[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

sarge freezes after failure of raid disk, incurring fs corruption on unrelated disk



First the good news: after some repairs, the system appears to be
'clean' again, running as usual.

The box has three hard disks: /dev/hda with the root partition and
/dev/hdb and /dev/hdd with a raid1 for data.

/---
athene:~# mount
/dev/hda1 on / type ext3 (rw,errors=remount-ro)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/hda6 on /diskathene type ext3 (rw)
/dev/md0 on /atheneraid type ext3 (rw)
/atheneraid/tausch on /tausch type none (rw,bind,uid=105,gid=100)
usbfs on /proc/bus/usb type usbfs (rw)
192.168.0.24:/home on /home type nfs (rw,addr=192.168.0.24)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
\---

Yesterday, /dev/hdb suddenly died:
/---
Feb  1 12:17:06 athene kernel: hdb: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Feb  1 12:17:06 athene kernel: hdb: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=324442003, high=19, low=5674899,
sector=3244419
84
Feb  1 12:17:06 athene kernel: end_request: I/O error, dev hdb, sector
324441984
Feb  1 12:17:08 athene kernel: hdb: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Feb  1 12:17:11 athene kernel: hdb: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=324442003, high=19, low=5674899,
sector=3244419
92
Feb  1 12:17:11 athene kernel: end_request: I/O error, dev hdb, sector
324441992
Feb  1 12:17:11 athene kernel: hdb: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Feb  1 12:17:11 athene kernel: hdb: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=324442003, high=19, low=5674899,
sector=3244420
00
\---
(This continues for a few MB)

At first the system was still working (except for the raid). It was
possible to ssh to the box and to diagnose via mdadm and looking at
syslog. After few minutes the system was frozen, it was unpingable,
impossible to locally switch to a console and/or log in, so I had to
switch it off the hard way.

I replaced the defective disk and rebooted. Rebuilding and syncing the
raid device took a few hours, but worked fine.

To check for sure that everything is ok I 'shutdown -rF now'ed the box.
On fscking / there were a lot of errors, involving files in /var/ .

After another boot and fsck of all partitions everything seemed fine.
Only after trying to install some program via aptitude, I got the
following:

/---
dpkg: serious warning: files list file for package `libident' missing,
assuming package has no files currently installed.

dpkg: serious warning: files list file for package `libldap2' missing,
assuming package has no files currently installed.

dpkg: serious warning: files list file for package `libmagic1' missing,
assuming package has no files currently installed.

dpkg: serious warning: files list file for package `libldap-2.2-7'
missing, assuming package has no files currently installed.

dpkg: serious warning: files list file for package `mpack' missing,
assuming package has no files currently installed.

dpkg: serious warning: files list file for package `binutils' missing,
assuming package has no files currently installed.
\---

This was very strange to me so I reinstalled all the mentioned packages.
That worked without problems, and now all the warnings are gone.

I would just like to _be sure_, that everything is ok now, and that
there are no more missing or damaged files around.

Here are my questions:

Is it save to leave the system as it is, or should I do a reinstall in
order to be sure that the system is 'clean'? How could I check, that no
other files are affected except those 'reinstalled'?

Is it common, that a failure of a raid disk leads to a system freeze,
even though the affected drive is _NOT_ part of / or any FSH directory?

Is it normal that an ext3 fs with journal gets corrupted in the process?

Is there anything I could do to try to avoid this for the future?


Thanks for your attention and your help!

Johannes



Reply to: