[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: sarge freezes after failure of raid disk, incurring fs corruption on unrelated disk



On Fri, Feb 02, 2007 at 05:19:48PM +0100, Johannes Wiedersich wrote:
> hendrik@topoi.pooq.com wrote:
> > These messages look similar -- but not identical -- to the ones I had 
> > while installing an etch system -- and eventually I came to suspect the 
> > file-system-damage bug in the Debian 2.6.18-3 kernel (sometimes 
> > because of a race condition a buffer is not written to hard disk, 
> > although it should have been).  It doesn't hit most systems, but when it 
> > does, it can be a disaster.  Eventually one of my installs ended up with 
> > an unparseable apt-related status file -- I think it was the list of 
> > installed packages.  I looked at it and it was binary gibberish 
> > (although it was supposed to be ordinary ASCII text).
> 
> I didn't know that sarge's kernel was also affected by this
> athene:~# uname -a
> Linux athene 2.6.8-3-k7 #1 Tue Dec 5 23:58:25 UTC 2006 i686 GNU/Linux

I don't think so.  In any case, I didn't know you were running sarge.  
As far as I know, sarge is pre-disaster.

> 
> >> Here are my questions:
> >>
> >> Is it save to leave the system as it is, or should I do a reinstall in
> >> order to be sure that the system is 'clean'? How could I check, that no
> >> other files are affected except those 'reinstalled'?
> >>
> >> Is it common, that a failure of a raid disk leads to a system freeze,
> >> even though the affected drive is _NOT_ part of / or any FSH directory?
> > 
> > I've noticed freezes with NFS -- if the remote system carrying the 
> > physical volume is shut off without the clients first unmounting it, the 
> > client processes freeze next time they try to access the NFS volume.  
> > Eventually more and more processes freeze, unkillably, and the system 
> > gradually grinds to a halt.  They stay frozen even if the remote system 
> > comes up again.  Oddly, if the remote system is brought up again 
> > *before* they access it, they never notice, and just run normally.
> > 
> > Could it be something similar?
> 
> Well, the box in question was _exporting_ the relevant partition via nfs
> and samba. Of course there were some 'problems' with the clients when
> the nfs suddenly disappeared...
> 
> So maybe these freezes also occur for the exporting machine.

Don't think so.

> 
> >> Is there anything I could do to try to avoid this for the future?
> > 
> > Maybe check for bad blocks?
> 
> I actually run smartmontools, but those mailed the bad health only when
> the drive was already dead...
> 
> > Maybe avoid having both parts of the RAID on the same IDE chain?
> 
> Sorry for forgetting to post this, but the raid consists of /dev/hdb and
> /dev/hdd. That is as far apart as possible on that
> probably-too-cheap-for-the-purpose box ;-)

OK.  You did that right, too.

> 
> Thanks,
> 
> Johannes
> 
> 
> -- 
> To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org 
> with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
> 



Reply to: