You are hereForums / Computers / CentOS server setup and maintenance notes / Well it almost worked
Well it almost worked
One of my goals when I set up my server was to have a system that wasn't likely to fail. To that end, I set up the system with Linux software RAID and an Uninterruptable Power Supply (UPS). The idea was that the UPS would bridge minor power outages, the UPS monitoring software would gracefully shutdown the server if the power failed for a long enough period of time, setting the BIOS to resume when the power was restored meant the system should restart and the RAID array would handle the failure of any single disk.
That was the plan.
As it turned out, the RAID array transparently handled the failure of one of the hard drives. The only outward sign of the drive failing was an e-mail from the RAID monitoring software (one for each partition on the disk) telling me that one of the drives had failed and the \array was now running in degraded mode. A sample failure e-mail looks like:
This is an automatically generated mail message from mdadm
running on fraud.davenjudy.org
A Fail event had been detected on md device /dev/md3.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1]
md0 : active raid1 hdb1[1] hda1[0]
200704 blocks [2/2] [UU]
md1 : active raid1 hdb3[1] hda3[0]
4192896 blocks [2/2] [UU]
md3 : active raid1 hdb5[1] hda5[2](F)
33246400 blocks [2/1] [_U]
md2 : active raid1 hdb2[1] hda2[2](F)
20972736 blocks [2/1] [_U]
unused devices: <none>As usual, things like this never happen when you have plenty of time to fix them and waiting gave me time to research how to recover once I swapped in a spare disk (I bought a couple of extra disks of the same size and type since I knew they would eventually die). I waited until I thought I had more than enough time to swap disk and then rebooted the system to just pick updated kernel version that had been installed but I hadn't rebooted to pick up. That's when things went to hell in a hurry.
I had originally built and tested my server with White Box Linux 3 which is a free clone of Red Hat Enterprise Linux (RHEL) version 3. At that time the GrUB boot loader only had minimal support for booting a RAID partition so I went with the tried and true LiLo boot loader. Testing with LiLo showed that I could easily remove the ribbon cable from any one drive and the system would boot correctly from the remaining drives. This fit with my requirement that the system be able to restart unattended after being down even if a drive had failed.
At some point I switched from White Box Linux to CentOS (another free clone of RHEL) in order to pick up the equivalent of RHEL version 4. LiLo hadn't quite gone away from Red Hat Linux but version 4 definitely showed it wasn't going to be supported for long. But then GrUB was now supposed to support RAID on the boot partition. Little did I know that this support required the administrator to take several non-obvious (and not even really agreed to by the GrUB user community) steps to make each of the drives of a RAID array bootable.
With the release of RHEL 5, I decided to get all of my systems running the same version of CentOS which happened to be version 5. At this point, not only was LiLo no longer supported, you can't even build it yourself without going through installing several packages (as86, ld86 and bcc) which are also not supported for RHEL or CentOS version 5. So, I was fully committed to GrUB whether I wanted to be or not. Taking the server back to version 4 would mean supporting two versions of Linux and would lose some functionality plus a number of applications would also be "different" on the server.
Back to present time and I had just attempted to reboot the server. Instead of the nice clean bot with no hassles that I expected, I got a scree full of giberish and an unusable system. "Aha," I said, "The old drive has just enough life left in it to screw up the boot process." So I powered down the server, pulled the IDE cable off of the failed drive and tried to boot again. This time I got a screen that just said GRUB. Not gibberish but definitely not a working system.
At this point I went searching for my installation/recovery CD. Recovery CD in hand, the trick was to figure out how to get a bootable system again. I finally found that I could mount the remaining disk in the array (the "linux recover" mode only found the empty, new disk) and run grub from there although the incantation was a tad odd since I had to tell grub where to find itself, the grub shell, etc. While in this mode, I also used fdisk to partition the new drive. I also set the BIOS to try to boot first from the "old" disk in the array.
This solved the problem and the system came up. I had found a really well written explanation of how to recover from a failed drive with Linux software RAID. The only problem was the explanation wasn't for RHEL/CentOS and relied on a command called raidhotadd. This command wasn't available but the mdadm man page had a clear explanation of how to use mdadm to rebuild my failed array.
All is happiness now with my server back up and running the way it's supposed to. I need to take some time to make sure that BOTH drives of the RAID array are bootable. Snide, sniping comment: It would be nice if the folks who maintain GrUB would implement transparent support of bootable RAID partitions (the way LiLo has for years).
Oh yeah, I'm still working on moving my blog from WordPress to Drupal but getting the RAID array back to 100% seemed like a higher priority.
Cheers,
Dave
![DaveAtFraud on Technorati [Technorati Profile]](http://davenjudy.org/me.jpg)

![Validate my RSS feed [Valid RSS]](http://davenjudy.org/valid-rss.png)