[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

HELP!!! trying to recover crashed system



Hi Folks,

Here it is, Friday, and my birthday to boot, with one fire on my desk already, when I discover that a critical server has crashed....

The server is running Sarge (I know, I was just about to upgrade, but if it ain't broke, why fix it), that just crashed this morning, and I'm having a horrible time recovering. Any help anyone can offer would be very much appreciated.

The basic configuration:
- i686 motherboard, Pentium chip
- 2 SATA channels, 2 drives on each (total of 4)
- 4 partitions on each drive
- 4 md devices are built across the four drives (for each - 3 hot drives, 1 spare)
- two md devices are used for boot and swap
- the other two md devices have logical volumes on top of them (LVM) - used for / and /backup (large archive)
- all MBRs set up to boot

The failure:
- looks like one of two SCSI interfaces has died, taking down the two attached drives
-- the system should keep running, but doesn't, and won't come up
--- it gets pretty far in the boot process, then starts throwing errors "devfs_mk_dir invalid argument, could not append to parent for /disc" and freezes - if I boot from a live CD, I get errors from the ATA driver (IO error, and so forth) - very obviously hardware errors

Luckily, I have an identical box avaiable. So... I simply moved the four disk drives from the failed machine, to the new one. Silly me, I figured it would just come up, the RAIDs would repair themselves, and I'd be back on the air. Instead:

- I get the same devfs_mk_dir error (but if I boot from a live CD, I DON'T get any hardware errors) -- suggests that one of the drives is so badly corrupted that the RAID can't rebuild --- when I try looking at the disks (start up the Debian installer, go into the partitioner), the partitioner freezes halfway through scanning the drives --- a little experimentation (pulling different drives) gets me to the point where the partitioner will start, and sees the various partitions ----- of course, at this point, I abort - I don't want to trash any of the data - with the bad drive pulled, I try to boot, but all I get is a "boot from CD" prompt

Where this leaves me:
- I don't want to trash the system (or the user data) on the drives, if I can avoid it (obviously)
- I need to recover sufficiently to boot
- from there I'd like to try to rebuild the RAID devices and logical volumes and see where I am - I'm guessing that something very basic has been trashed - like the MBR, or grub configuration

So.... any suggestions would be very much appreciated as to:

1. rescue tools - particularly something that lets me try to mount the existing md devices and LVMs, and then boot
2. generally restoring the system to a bootable state (mbr, grub, etc.)
3. thoughts on examining the one drive that might or might not be bad
-- diagnostic
-- if good: recovery or reformatting so I can add it back to the RAID/LVM pool -- if bad: how to configure a spare drive to stick it into the existing RAID/LVM pool

Thanks VERY much.
Miles Fidelman




Reply to: