Answering all messages in this thread in one:
also sprach Seth Mattinen <sethm@rollernet.us> [2009.07.19.0206 +0200]:
> [3948800.929508] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
> [3948800.949314] ata2.00: BMDMA stat 0x4
> [3948800.960273] ata2.00: cmd ca/00:10:6f:02:22/00:00:00:00:00/e0 tag
> 0 dma 8192 out
> [3948800.960276] res 51/84:0a:75:02:22/00:00:00:00:00/e0
> Emask 0x10 (ATA bus error)
> [3948801.007509] ata2.00: status: { DRDY ERR }
> [3948801.020017] ata2.00: error: { ICRC ABRT }
> [3948801.032537] ata2: soft resetting link
> [3948801.212298] ata2.00: configured for UDMA/33
> [3948801.225345] ata2: EH complete
I occasionally see those on my servers and have not yet been able to
figure out what they mean. I think they are related to SMART
self-tests initiated by smartd. Are you running any of those?
> It's running software raid, so why is it locking up? I managed to
> log in as root and cat /proc/mdstat:
>
> Personalities : [raid1]
> md0 : active raid1 sda1[0] sdb1[1]
> 78148096 blocks [2/2] [UU]
> unused devices: <none>
Yeah, the same happens here: the RAID does not degrade. This gives
me moderate levels of confidence that the kernel messages relate to
something that is not actually an error and does not relate to
a read error, just a hiccough, which isn't a bad deal and everyone
just moves on.
> Huh, I think to myself, stupid thing didn't work. So I try to
> manually fault it because it didn't figure it out on its own:
>
> # mdadm /dev/md0 --fail /dev/sdb1
>
> But that didn't help either:
>
> [3948838.514699] raid1: Disk failure on sdb1, disabling device.
> [3948838.514702] raid1: Operation continuing on 1 devices.
> [3948846.397781] RAID1 conf printout:
> [3948846.409726] --- wd:1 rd:2
> [3948846.418353] disk 0, wo:0, o:1, dev:sda1
> [3948846.430623] disk 1, wo:1, o:0, dev:sdb1
> [3948846.452002] md: recovery of RAID array md0
Now *this* is very weird though. If I do the same on a test array,
it stays faulted:
[323219.282745] raid1: Disk failure on sda, disabling device.
[323219.282747] raid1: Operation continuing on 1 devices.
[323219.314934] RAID1 conf printout:
[323219.316738] --- wd:1 rd:2
[323219.318302] disk 0, wo:1, o:0, dev:sda
[323219.320309] disk 1, wo:0, o:1, dev:sdc
[323219.322267] RAID1 conf printout:
[323219.324046] --- wd:1 rd:2
[323219.325645] disk 1, wo:0, o:1, dev:sdc
I can restart mdadm, I can even restart the entire system, but the
RAID stays faulted.
Granted, the system is 2.6.30-1 with mdadm 2.6.9-3, but I wouldn't
think this made a big difference.
also sprach Henrique de Moraes Holschuh <hmh@debian.org> [2009.07.19.0218 +0200]:
> > It's running software raid, so why is it locking up? I managed to
>
> Because the EH can be quite... anoying to other ports in the same
> controller, especially if the controller is a shitty one (I don't know if
> that's the case).
Care to elaborate? I don't even know what "the EH" is.
also sprach Henrique de Moraes Holschuh <hmh@debian.org> [2009.07.19.0233 +0200]:
> > >I have never seen anything like this. Do you have any daemons
> > >trying to do "hotspare" services for MD? THAT could be it...
> >
> > Not that I'm aware of. How would I check?
>
> A bug in mdadm --monitor might cause it, I think. But I have
> never heard of it happening.
I don't see how --monitor could have anything to do with it.
also sprach Seth Mattinen <sethm@rollernet.us> [2009.07.19.0237 +0200]:
> I have a 600k capture file off the serial console if you're
> interested. Unfortunately it's a production system so I can't play
> with it.
Sure. ftp://ftp.madduck.net/incoming
also sprach Henrique de Moraes Holschuh <hmh@debian.org> [2009.07.19.0246 +0200]:
> I wouldn't be able to help you much with it, but either the mdadm
> maintainer (IF "ps auxwww | grep mdadm" tells you mdadm is
> running) or a bug in bugzilla.kernel.org directed to the md driver
> maintainer should be able to get you someone who can ;)
I'd write to linux-raid@vger.kernel.org first (please CC me), or
file a bug against mdadm in the Debian bug tracker.
Cheers,
--
.''`. martin f. krafft <madduck@d.o> Related projects:
: :' : proud Debian developer http://debiansystem.info
`. `'` http://people.debian.org/~madduck http://vcs-pkg.org
`- Debian - when you have better things to do than fixing systems
"i never go without my dinner. no one ever does, except vegetarians
and people like that."
-- oscar wilde
Attachment:
digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/)