how to do/repair a raid1 missing disk install (was: Re: lilo + raid = disaster (again))

To: Antony Gelberg <antony@antgel.co.uk>, debian-user <debian-user@lists.debian.org>
Subject: how to do/repair a raid1 missing disk install (was: Re: lilo + raid = disaster (again))
From: Henrique de Moraes Holschuh <hmh@debian.org>
Date: Fri, 26 Mar 2004 01:44:18 -0300
Message-id: <[🔎] 20040326044418.GA32671@khazad-dum.debian.net>
In-reply-to: <[🔎] 20040326005813.GC6336@brain.pulsesol.com>
References: <[🔎] 20040326005813.GC6336@brain.pulsesol.com>

WARNING: it is late at night here, make damn sure what is written in this
email actually makes sense before you attempt it.  I am somewhat sleepy.

Debian-boot removed from CC since this reply is OT there.

On Fri, 26 Mar 2004, Antony Gelberg wrote:
> I had a nice Woody system up and running on hda.  I created /dev/md0
> with hdc, and a missing drive.  I did a cp -ax to copy everything on hda
> to md0.

So far so good. It's best to use mdadm and drop raidtools. Using mdadm
[without a config file] is much less error-prone than raidtools can ever hope
to be.

I have done what you attempted to at least five times now, and every time I
have to use a checklist, and to be extremely careful to avoid trouble. As
others said, "missing disk installs are a sure way to lose data".

> All I needed was to get the boot loader sorted.  I put boot=/dev/md0 and
> root=/dev/md0 in lilo.conf, and changed fstab to mount / on md0.  Lilo

Lilo promptly corrupted some of your data.  The thing to understand about
lilo is that whomever wrote the initial RAID1 support for lilo was either a
moron, or a very sick person.  The same goes for whomever had the
"brilliant" idea to suggest people to do "boot=/dev/hda1" in the howtos, I
should add.

New lilo can be told to act in a sane way, but I don't think it is the
default.  If you tell lilo that "boot=/dev/md0", you MUST give it
"raid-extra-boot=mbr-only" too.

/etc/lilo.conf:
boot=/dev/md0
raid-extra-boot=mbr-only

Otherwise, the dumb PoS will overwrite the first sector(s?) of whatever is
in your raid array, and that can be quite fatal to whatever is in there.
Indeed, something quite stupid to have as a default for anything IMHO.

Make sure you are using a new enough lilo to get the mbr-only option.

> came back with some errors.  Unfortunately I don't have them to hand,
> but it was something like the boot map not being on the root device
> (this is vague, sorry).

You did a snafu on the bootloader-and-kernel side of things too.  Do this to
repair:

1. Get a boot disk that supports RAID1, your SATA drives, and whatever
filesystem you used. A Good bet is a knoppix live CD.

2. Boot from it. Verify that you can see your SATA disks.

3. Manually start the RAID.

4. Repair whatever is the first thing in your RAID array. For filesystems,
that means *_repair or fsck.  For a lvm1 or lvm2 PV, I have no idea.

5. Mount /dev/md0 somewhere. Go in there, fix etc/lilo.conf with the
mbr-only option, chroot . , run lilo.

If the RAID array already has the two disks in it, umount everything,
shutdown the raid, reboot and you're done.  If it does not, go to pass 6
below.

> (I have another box booting off RAID-1, and it doesn't need the
> raid-extra-boot line.)

It never does. You can tell lilo to install the MBR in /dev/hda, and that
will work just fine.  However, you won't have a copy of the boot loader in
/dev/hdb, so should /dev/hda fail, you will not survive a reboot.

The problem was caused because you did not correctly move the root fs and
kernel to the RAID array.

The process goes more or less like this:

1. Create the RAID with the missing disk. Prepare the filesystems in the
   RAID array. Double-check everything
2. go single user. Copy over all filesystems to the new ones in the RAID
   array. Fix etc/fstab and etc/lilo.conf (to avoid mistakes later) in the 
   new root filesystem
3. Change only the "root" option of lilo (or give the correct one during
   the reboot) in the current filesystem, run lilo to update that.

   sync, umount everything, reboot.

Now, you have your system running entirely from the RAID array *BUT THE
KERNEL AND BOOT LOADERS ARE STILL LOADED FROM THE OLD DISK*.

If you get an unexpected reboot from this point on, you will need
a bootdisk to get the system up again. You've been warned.

MAKE VERY VERY SURE that you did boot with the root partition set
to the RAID array. Otherwise, pass 6 will block and you WILL NEED TO
REBOOT WITH A BOOT DISK/CD TO RECOVER.

4. verify that /etc/lilo.conf is sane (root points to the fs in the
   raid array, boot=/dev/md0 is there, and that raid-extra-boot=mbr-only
   is also there). DO NOT RUN LILO YET.

5. make very damn sure nothing has anything in the old disk open.
   One safe way to check that is to run good old fdisk on the disk,
   print the partition table, change nothing, tell it to write to the
   disk.  IF it complains that the kernel did not re-read the partition
   table, go find where you did wrong before you kill your system.

   Rebooting right now should still work.

6. re-partition the old disk. From now on, reboots are impossible. If
   the kernel doesn't update the partition table, you are screwed, get
   your rescue CD and go fix it with a reboot and doing everything else
   from the CD.

   Add the newly partitioned disk to the RAID array. Make sure all
   RAID partitions in both disks are of type 0xFD, so that the kernel
   can autorun the raid array.  It doesn't matter if you have to update
   the partition table of the currently active second disk, as you are
   just changing partition types, and it doesn't matter if the kernel
   reloads the partition table or not for that.

7. now the raid is syncing. run lilo. It will write all the crap to
   the proper places. run "sync". Chances are that a forced reboot now
   will not hose your system anymore.

8. wait until the RAID sync finishes. You're done.

You have to test the hard way to know for sure if your system will reboot
correctly from the second disk if the first one dies.  With new lilo, in any
recent system, it should.  On older ones, only God knows what the BIOS will
do.  

If you are going to do that test, mdadm --manage /dev/md0 --fail
/dev/hda before you unplug it to simulate a boot with a broken first disk.
That way, you know you will not lose any data.  

If the system boots with the first disk unplugged, shut it down cleanly,
replug the disk, boot, and remove the "failed" drive from the RAID set and
re-add it to get it back online. 

If the system doesn't boot, plug the drive back and reboot. It will boot
(lilo doesn't care that md will think that drive is hosed), and after the
boot is complete, do the remove-and-readd dance to get the disk back online
in the RAID array.

One last warning: I never do something like this without a bootdisk/CD
handy, and console access.  If anything goes wrong in the kernel move, you
will need it.  Out of the 5 times I did this, I had to use the rescue CD
3 times, to get the kernel and lilo in the first disk.  It is tricky to get
it right.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

Reply to:

Follow-Ups:
- Re: how to do/repair a raid1 missing disk install (was: Re: lilo + raid = disaster (again))
  - From: Antony Gelberg <antony@antgel.co.uk>

References:
- lilo + raid = disaster (again)
  - From: Antony Gelberg <antony@antgel.co.uk>

Prev by Date: Re: DNS setup
Next by Date: Re: Need help finding NIC driver
Previous by thread: Re: lilo + raid = disaster (again)
Next by thread: Re: how to do/repair a raid1 missing disk install (was: Re: lilo + raid = disaster (again))
Index(es):
- Date
- Thread