[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: sarge freezes after failure of raid disk, incurring fs corruption on unrelated disk



On Fri, Feb 02, 2007 at 02:03:16PM +0100, Johannes Wiedersich wrote:
> First the good news: after some repairs, the system appears to be
> 'clean' again, running as usual.
> 
> The box has three hard disks: /dev/hda with the root partition and
> /dev/hdb and /dev/hdd with a raid1 for data.

I agree with Doug, probably the failing hdb brought down the whole ide
controller resulting in corruption on hda. Also, a pull-the-plug
shutdown can easily corrupt info.

> 
> /---
> athene:~# mount
> /dev/hda1 on / type ext3 (rw,errors=remount-ro)
> proc on /proc type proc (rw)
> sysfs on /sys type sysfs (rw)
> devpts on /dev/pts type devpts (rw,gid=5,mode=620)
> tmpfs on /dev/shm type tmpfs (rw)
> /dev/hda6 on /diskathene type ext3 (rw)
> /dev/md0 on /atheneraid type ext3 (rw)
> /atheneraid/tausch on /tausch type none (rw,bind,uid=105,gid=100)
> usbfs on /proc/bus/usb type usbfs (rw)
> 192.168.0.24:/home on /home type nfs (rw,addr=192.168.0.24)
> none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
> \---
> 
> Yesterday, /dev/hdb suddenly died:

bummer

> I replaced the defective disk and rebooted. Rebuilding and syncing the
> raid device took a few hours, but worked fine.
> 
> To check for sure that everything is ok I 'shutdown -rF now'ed the box.
> On fscking / there were a lot of errors, involving files in /var/ .

makes sense, /var has lots of files open when a system is
running. That results in /var corruption when the drive goes down. Not
to mention there's lots of r/w to var and if that's happening during a
failure, it'll definitely get munched.

> 
> After another boot and fsck of all partitions everything seemed fine.
> Only after trying to install some program via aptitude, I got the
> following:
> 
> /---
> dpkg: serious warning: files list file for package `libident' missing,
> assuming package has no files currently installed.
> 
> dpkg: serious warning: files list file for package `libldap2' missing,
> assuming package has no files currently installed.
> 
> dpkg: serious warning: files list file for package `libmagic1' missing,
> assuming package has no files currently installed.
> 
> dpkg: serious warning: files list file for package `libldap-2.2-7'
> missing, assuming package has no files currently installed.
> 
> dpkg: serious warning: files list file for package `mpack' missing,
> assuming package has no files currently installed.
> 
> dpkg: serious warning: files list file for package `binutils' missing,
> assuming package has no files currently installed.

do you have records of the fsck's you did? What you've got here is
stuff missing from /var. If all your corruption was in /var, you may
be safe. it says "files list file"... that's part of apt's system
right? Its just apt that thinks stuff is missing because apt keeps its
records in /var. Its very likely that none of these packages were
actually damaged, just the apt records of these packages.

> \---
> 
> This was very strange to me so I reinstalled all the mentioned packages.
> That worked without problems, and now all the warnings are gone.
> 
> I would just like to _be sure_, that everything is ok now, and that
> there are no more missing or damaged files around.
> 
> Here are my questions:
> 
> Is it save to leave the system as it is, or should I do a reinstall in
> order to be sure that the system is 'clean'? How could I check, that no
> other files are affected except those 'reinstalled'?

depends on how you use the system. If its got mission critical stuff
on it, then probably not. Otherwise, I'd say its probably okay. If you
dist-upgrade regularly, or install stuff regularly, you'll eventually
run across all the problems and fix them. Likewise, as you discover
commands that don't work, you'll know there was a problem. 

But... better just reinstall. 

> 
> Is it common, that a failure of a raid disk leads to a system freeze,
> even though the affected drive is _NOT_ part of / or any FSH directory?
> 
> Is it normal that an ext3 fs with journal gets corrupted in the process?

sure. any fs can get corrupted, especially when the ide channel is
fried.

> 
> Is there anything I could do to try to avoid this for the future?

as Doug said, split your disks onto different ide channels. Pick up a
$20 pci-ide adapter and move your disks onto there. Maybe put your /
on raid 1 as well. 

Sounds to me though like your RAID did its job: it protected the data
in the array -- it came up and resynced properly. Unfortunately, your
/ got hammered by unfortunate circumstance. Good lesson I'd say. :)

A

Attachment: signature.asc
Description: Digital signature


Reply to: