Re: sarge freezes after failure of raid disk, incurring fs corruption on unrelated disk
On Fri, Feb 02, 2007 at 02:03:16PM +0100, Johannes Wiedersich wrote:
> First the good news: after some repairs, the system appears to be
> 'clean' again, running as usual.
> The box has three hard disks: /dev/hda with the root partition and
> /dev/hdb and /dev/hdd with a raid1 for data.
> Yesterday, /dev/hdb suddenly died:
> At first the system was still working (except for the raid). It was
> possible to ssh to the box and to diagnose via mdadm and looking at
> syslog. After few minutes the system was frozen, it was unpingable,
> impossible to locally switch to a console and/or log in, so I had to
> switch it off the hard way.
> I replaced the defective disk and rebooted. Rebuilding and syncing the
> raid device took a few hours, but worked fine.
> To check for sure that everything is ok I 'shutdown -rF now'ed the box.
> On fscking / there were a lot of errors, involving files in /var/ .
> After another boot and fsck of all partitions everything seemed fine.
> Only after trying to install some program via aptitude, I got the
> dpkg: serious warning: files list file for package `libident' missing,
> assuming package has no files currently installed.
> This was very strange to me so I reinstalled all the mentioned packages.
> That worked without problems, and now all the warnings are gone.
> I would just like to _be sure_, that everything is ok now, and that
> there are no more missing or damaged files around.
> Here are my questions:
> Is it save to leave the system as it is, or should I do a reinstall in
> order to be sure that the system is 'clean'? How could I check, that no
> other files are affected except those 'reinstalled'?
> Is it common, that a failure of a raid disk leads to a system freeze,
> even though the affected drive is _NOT_ part of / or any FSH directory?
> Is it normal that an ext3 fs with journal gets corrupted in the process?
> Is there anything I could do to try to avoid this for the future?
Personally, I run something like samhain, not so much to check for
intrusion as to monitor data integrity.
I wonder if the failed /dev/hdb took out the controller (ide0) and so /dev/hda
Its too bad that your system (as opposed to data) wasn't also protected
If it were me and I had solid backups and could afford the downtime to
reinstall, I'd reinstall. Etch. LVM on Raid1 for the system at least.
And I'd only have one drive on each controller channel. E.g. no hdb or
hdd unless its for CD/DVD or something.
Before a total reinstall, I'd really stress-test the ide0 controller to
ensure that it wasn't damaged.
Then again, I'm paranoid. I've only had one drive failure (bearing
seize, hard head crash). Pulled the plug within 10 seconds. No other
I really hope you have good backups.