[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: sarge freezes after failure of raid disk, incurring fs corruption on unrelated disk



On Fri, Feb 02, 2007 at 02:03:16PM +0100, Johannes Wiedersich wrote:
> First the good news: after some repairs, the system appears to be
> 'clean' again, running as usual.
> 
> The box has three hard disks: /dev/hda with the root partition and
> /dev/hdb and /dev/hdd with a raid1 for data.
> 
> /---
> athene:~# mount
> /dev/hda1 on / type ext3 (rw,errors=remount-ro)
> proc on /proc type proc (rw)
> sysfs on /sys type sysfs (rw)
> devpts on /dev/pts type devpts (rw,gid=5,mode=620)
> tmpfs on /dev/shm type tmpfs (rw)
> /dev/hda6 on /diskathene type ext3 (rw)
> /dev/md0 on /atheneraid type ext3 (rw)
> /atheneraid/tausch on /tausch type none (rw,bind,uid=105,gid=100)
> usbfs on /proc/bus/usb type usbfs (rw)
> 192.168.0.24:/home on /home type nfs (rw,addr=192.168.0.24)
> none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
> \---
> 
> Yesterday, /dev/hdb suddenly died:
> /---
> Feb  1 12:17:06 athene kernel: hdb: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Feb  1 12:17:06 athene kernel: hdb: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=324442003, high=19, low=5674899,
> sector=3244419
> 84
> Feb  1 12:17:06 athene kernel: end_request: I/O error, dev hdb, sector
> 324441984
> Feb  1 12:17:08 athene kernel: hdb: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Feb  1 12:17:11 athene kernel: hdb: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=324442003, high=19, low=5674899,
> sector=3244419
> 92
> Feb  1 12:17:11 athene kernel: end_request: I/O error, dev hdb, sector
> 324441992
> Feb  1 12:17:11 athene kernel: hdb: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Feb  1 12:17:11 athene kernel: hdb: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=324442003, high=19, low=5674899,
> sector=3244420
> 00
> \---
> (This continues for a few MB)
> 
> At first the system was still working (except for the raid). It was
> possible to ssh to the box and to diagnose via mdadm and looking at
> syslog. After few minutes the system was frozen, it was unpingable,
> impossible to locally switch to a console and/or log in, so I had to
> switch it off the hard way.
> 
> I replaced the defective disk and rebooted. Rebuilding and syncing the
> raid device took a few hours, but worked fine.
> 
> To check for sure that everything is ok I 'shutdown -rF now'ed the box.
> On fscking / there were a lot of errors, involving files in /var/ .
> 
> After another boot and fsck of all partitions everything seemed fine.
> Only after trying to install some program via aptitude, I got the
> following:
> 
> /---
> dpkg: serious warning: files list file for package `libident' missing,
> assuming package has no files currently installed.
> 
> dpkg: serious warning: files list file for package `libldap2' missing,
> assuming package has no files currently installed.
> 
> dpkg: serious warning: files list file for package `libmagic1' missing,
> assuming package has no files currently installed.
> 
> dpkg: serious warning: files list file for package `libldap-2.2-7'
> missing, assuming package has no files currently installed.
> 
> dpkg: serious warning: files list file for package `mpack' missing,
> assuming package has no files currently installed.
> 
> dpkg: serious warning: files list file for package `binutils' missing,
> assuming package has no files currently installed.
> \---

These messages look similar -- but not identical -- to the ones I had 
while installing an etch system -- and eventually I came to suspect the 
file-system-damage bug in the Debian 2.6.18-3 kernel (sometimes 
because of a race condition a buffer is not written to hard disk, 
although it should have been).  It doesn't hit most systems, but when it 
does, it can be a disaster.  Eventually one of my installs ended up with 
an unparseable apt-related status file -- I think it was the list of 
installed packages.  I looked at it and it was binary gibberish 
(although it was supposed to be ordinary ASCII text).

My system was not any kind of RAID or LVM system, though.  The bug has 
been fixed -- so I've heard -- in the Debian 2.6.18-4 kernel.  I'm 
waiting for that one to percolate into the etch installer, whereupon 
I'll be trying to install again.

In the meantime, I've performed a bad-block check on the drive and 
discovered some of the address markers were inreadable.  So
 (a) that might have been the problem, although I've had no indication 
that was (those messages appeared during bad-block checking, but 
I don't remember seeing them during the troubles before), but
 (b) it would be good for me to do a low-level format on the drive, or 
else replace it, before I try again.

> 
> This was very strange to me so I reinstalled all the mentioned packages.
> That worked without problems, and now all the warnings are gone.

Well, you've been lucky.

> 
> I would just like to _be sure_, that everything is ok now, and that
> there are no more missing or damaged files around.
> 
> Here are my questions:
> 
> Is it save to leave the system as it is, or should I do a reinstall in
> order to be sure that the system is 'clean'? How could I check, that no
> other files are affected except those 'reinstalled'?
> 
> Is it common, that a failure of a raid disk leads to a system freeze,
> even though the affected drive is _NOT_ part of / or any FSH directory?

I've noticed freezes with NFS -- if the remote system carrying the 
physical volume is shut off without the clients first unmounting it, the 
client processes freeze next time they try to access the NFS volume.  
Eventually more and more processes freeze, unkillably, and the system 
gradually grinds to a halt.  They stay frozen even if the remote system 
comes up again.  Oddly, if the remote system is brought up again 
*before* they access it, they never notice, and just run normally.

Could it be something similar?


> 
> Is it normal that an ext3 fs with journal gets corrupted in the process?

I would guess not.
On the other hand, I had problems with both JFS and ext3, but I didn't 
have a RAID.

> 
> Is there anything I could do to try to avoid this for the future?

Maybe check for bad blocks?
Maybe avoid having both parts of the RAID on the same IDE chain?
Have full backups?
Use a different kernel (in case that's the problem)?

> 
> 
> Thanks for your attention and your help!
> 
> Johannes
> 
> 
> -- 
> To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org 
> with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
> 



Reply to: