Re: sarge freezes after failure of raid disk, incurring fs corruption on unrelated disk
> These messages look similar -- but not identical -- to the ones I had
> while installing an etch system -- and eventually I came to suspect the
> file-system-damage bug in the Debian 2.6.18-3 kernel (sometimes
> because of a race condition a buffer is not written to hard disk,
> although it should have been). It doesn't hit most systems, but when it
> does, it can be a disaster. Eventually one of my installs ended up with
> an unparseable apt-related status file -- I think it was the list of
> installed packages. I looked at it and it was binary gibberish
> (although it was supposed to be ordinary ASCII text).
I didn't know that sarge's kernel was also affected by this
athene:~# uname -a
Linux athene 2.6.8-3-k7 #1 Tue Dec 5 23:58:25 UTC 2006 i686 GNU/Linux
>> Here are my questions:
>> Is it save to leave the system as it is, or should I do a reinstall in
>> order to be sure that the system is 'clean'? How could I check, that no
>> other files are affected except those 'reinstalled'?
>> Is it common, that a failure of a raid disk leads to a system freeze,
>> even though the affected drive is _NOT_ part of / or any FSH directory?
> I've noticed freezes with NFS -- if the remote system carrying the
> physical volume is shut off without the clients first unmounting it, the
> client processes freeze next time they try to access the NFS volume.
> Eventually more and more processes freeze, unkillably, and the system
> gradually grinds to a halt. They stay frozen even if the remote system
> comes up again. Oddly, if the remote system is brought up again
> *before* they access it, they never notice, and just run normally.
> Could it be something similar?
Well, the box in question was _exporting_ the relevant partition via nfs
and samba. Of course there were some 'problems' with the clients when
the nfs suddenly disappeared...
So maybe these freezes also occur for the exporting machine.
>> Is there anything I could do to try to avoid this for the future?
> Maybe check for bad blocks?
I actually run smartmontools, but those mailed the bad health only when
the drive was already dead...
> Maybe avoid having both parts of the RAID on the same IDE chain?
Sorry for forgetting to post this, but the raid consists of /dev/hdb and
/dev/hdd. That is as far apart as possible on that
probably-too-cheap-for-the-purpose box ;-)