Posted on March 1, 2014 @ 08:56:00 AM by Paul Meagher
On Friday evening the servers rebooted and the main server didn't reboot normally. Something in the startup process got corrupted and only left me with a very basic level of access to the server. The decision was made to reinstall a minimal operating system on another hard drive and to rebuild the web server from there. We slaved the old hard drive into the main server box so I could mount the old server hard drive using the server hard drive I was rebuilding. The fix involved installing and configuring numerous pieces of server software, copying data from the old server to the new server, and testing that everything was working as it should again. The process took around 14 hours and mostly happened overnight and into Saturday morning. I apologize for the downtime.
I've had to do two disaster recoveries in the last 3 weeks and I suspect it isn't a coincidence. The first disaster involved a failing hard disk and recovery from a raid disk that still appeared to function normally. We never lost any data from that disaster and were able to copy the raid array to a new disk and rebuild the raid array with around 6 hours of downtime. I'm suspicious that some corrupted initialization software may have found its way into the rebuilt raid array. It probably would have been better to rebuild the server at that point and copy the data over to it so I could be assured that the basic initialization/kernel software was not corrupted in any way when it was installed onto a new hard disk. This is the situation now, however, as I have a new operating system and new server software on my rebuilt server and the next step will be to create a raid array from this next week at some point so I have full redundancy again. The creation of the raid system will be done without any downtime sometime this week.
So that was how I spent the last 14 hours from Friday night into Saturday morning. It was pretty tense at first but once I figured out how to mount the old hard drive and saw that I had all my data it was a matter of going through all the motions to rebuild the server, copy data over, and test. I had other backups of my data but it is easier to recover if you can access the contents of your original hard drive to do the recovery from. Hopefully, I won't have any more potential disasters to deal with for awhile. I'm used to running my linux-based servers for half a year to a year without rebooting so this is not the norm. I learned quite a bit about the
mount commands and the idea of creating a virtual device to mount the old disk saved my bacon:
How to mount software RAID1 member using mdadm.