Answering all messages in this thread in one: also sprach Seth Mattinen <sethm@rollernet.us> [2009.07.19.0206 +0200]: > [3948800.929508] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 > [3948800.949314] ata2.00: BMDMA stat 0x4 > [3948800.960273] ata2.00: cmd ca/00:10:6f:02:22/00:00:00:00:00/e0 tag > 0 dma 8192 out > [3948800.960276] res 51/84:0a:75:02:22/00:00:00:00:00/e0 > Emask 0x10 (ATA bus error) > [3948801.007509] ata2.00: status: { DRDY ERR } > [3948801.020017] ata2.00: error: { ICRC ABRT } > [3948801.032537] ata2: soft resetting link > [3948801.212298] ata2.00: configured for UDMA/33 > [3948801.225345] ata2: EH complete I occasionally see those on my servers and have not yet been able to figure out what they mean. I think they are related to SMART self-tests initiated by smartd. Are you running any of those? > It's running software raid, so why is it locking up? I managed to > log in as root and cat /proc/mdstat: > > Personalities : [raid1] > md0 : active raid1 sda1[0] sdb1[1] > 78148096 blocks [2/2] [UU] > unused devices: <none> Yeah, the same happens here: the RAID does not degrade. This gives me moderate levels of confidence that the kernel messages relate to something that is not actually an error and does not relate to a read error, just a hiccough, which isn't a bad deal and everyone just moves on. > Huh, I think to myself, stupid thing didn't work. So I try to > manually fault it because it didn't figure it out on its own: > > # mdadm /dev/md0 --fail /dev/sdb1 > > But that didn't help either: > > [3948838.514699] raid1: Disk failure on sdb1, disabling device. > [3948838.514702] raid1: Operation continuing on 1 devices. > [3948846.397781] RAID1 conf printout: > [3948846.409726] --- wd:1 rd:2 > [3948846.418353] disk 0, wo:0, o:1, dev:sda1 > [3948846.430623] disk 1, wo:1, o:0, dev:sdb1 > [3948846.452002] md: recovery of RAID array md0 Now *this* is very weird though. If I do the same on a test array, it stays faulted: [323219.282745] raid1: Disk failure on sda, disabling device. [323219.282747] raid1: Operation continuing on 1 devices. [323219.314934] RAID1 conf printout: [323219.316738] --- wd:1 rd:2 [323219.318302] disk 0, wo:1, o:0, dev:sda [323219.320309] disk 1, wo:0, o:1, dev:sdc [323219.322267] RAID1 conf printout: [323219.324046] --- wd:1 rd:2 [323219.325645] disk 1, wo:0, o:1, dev:sdc I can restart mdadm, I can even restart the entire system, but the RAID stays faulted. Granted, the system is 2.6.30-1 with mdadm 2.6.9-3, but I wouldn't think this made a big difference. also sprach Henrique de Moraes Holschuh <hmh@debian.org> [2009.07.19.0218 +0200]: > > It's running software raid, so why is it locking up? I managed to > > Because the EH can be quite... anoying to other ports in the same > controller, especially if the controller is a shitty one (I don't know if > that's the case). Care to elaborate? I don't even know what "the EH" is. also sprach Henrique de Moraes Holschuh <hmh@debian.org> [2009.07.19.0233 +0200]: > > >I have never seen anything like this. Do you have any daemons > > >trying to do "hotspare" services for MD? THAT could be it... > > > > Not that I'm aware of. How would I check? > > A bug in mdadm --monitor might cause it, I think. But I have > never heard of it happening. I don't see how --monitor could have anything to do with it. also sprach Seth Mattinen <sethm@rollernet.us> [2009.07.19.0237 +0200]: > I have a 600k capture file off the serial console if you're > interested. Unfortunately it's a production system so I can't play > with it. Sure. ftp://ftp.madduck.net/incoming also sprach Henrique de Moraes Holschuh <hmh@debian.org> [2009.07.19.0246 +0200]: > I wouldn't be able to help you much with it, but either the mdadm > maintainer (IF "ps auxwww | grep mdadm" tells you mdadm is > running) or a bug in bugzilla.kernel.org directed to the md driver > maintainer should be able to get you someone who can ;) I'd write to linux-raid@vger.kernel.org first (please CC me), or file a bug against mdadm in the Debian bug tracker. Cheers, -- .''`. martin f. krafft <madduck@d.o> Related projects: : :' : proud Debian developer http://debiansystem.info `. `'` http://people.debian.org/~madduck http://vcs-pkg.org `- Debian - when you have better things to do than fixing systems "i never go without my dinner. no one ever does, except vegetarians and people like that." -- oscar wilde
Attachment:
digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/)