Re: Why didn't software RAID detect a faulty drive?

To: debian-isp@lists.debian.org
Subject: Re: Why didn't software RAID detect a faulty drive?
From: martin f krafft <madduck@debian.org>
Date: Sun, 19 Jul 2009 09:37:26 +0200
Message-id: <[🔎] 20090719073726.GA27473@lapse.rw.madduck.net>
Mail-followup-to: debian-isp@lists.debian.org
In-reply-to: <[🔎] 20090719004603.GC26753@khazad-dum.debian.net> <[🔎] 4A626AE7.3030004@rollernet.us> <[🔎] 20090719003336.GB26753@khazad-dum.debian.net> <[🔎] 20090719001815.GA26753@khazad-dum.debian.net> <[🔎] 4A62637B.5020808@rollernet.us>

Answering all messages in this thread in one:

also sprach Seth Mattinen <sethm@rollernet.us> [2009.07.19.0206 +0200]:
> [3948800.929508] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
> [3948800.949314] ata2.00: BMDMA stat 0x4
> [3948800.960273] ata2.00: cmd ca/00:10:6f:02:22/00:00:00:00:00/e0 tag
> 0 dma 8192 out
> [3948800.960276]          res 51/84:0a:75:02:22/00:00:00:00:00/e0
> Emask 0x10 (ATA bus error)
> [3948801.007509] ata2.00: status: { DRDY ERR }
> [3948801.020017] ata2.00: error: { ICRC ABRT }
> [3948801.032537] ata2: soft resetting link
> [3948801.212298] ata2.00: configured for UDMA/33
> [3948801.225345] ata2: EH complete

I occasionally see those on my servers and have not yet been able to
figure out what they mean. I think they are related to SMART
self-tests initiated by smartd. Are you running any of those?

> It's running software raid, so why is it locking up? I managed to
> log in as root and cat /proc/mdstat:
> 
> Personalities : [raid1]
> md0 : active raid1 sda1[0] sdb1[1]
>       78148096 blocks [2/2] [UU]
> unused devices: <none>

Yeah, the same happens here: the RAID does not degrade. This gives
me moderate levels of confidence that the kernel messages relate to
something that is not actually an error and does not relate to
a read error, just a hiccough, which isn't a bad deal and everyone
just moves on.

> Huh, I think to myself, stupid thing didn't work. So I try to
> manually fault it because it didn't figure it out on its own:
> 
> # mdadm /dev/md0 --fail /dev/sdb1
> 
> But that didn't help either:
> 
> [3948838.514699] raid1: Disk failure on sdb1, disabling device.
> [3948838.514702] raid1: Operation continuing on 1 devices.
> [3948846.397781] RAID1 conf printout:
> [3948846.409726]  --- wd:1 rd:2
> [3948846.418353]  disk 0, wo:0, o:1, dev:sda1
> [3948846.430623]  disk 1, wo:1, o:0, dev:sdb1
> [3948846.452002] md: recovery of RAID array md0

Now *this* is very weird though. If I do the same on a test array,
it stays faulted:

[323219.282745] raid1: Disk failure on sda, disabling device.
[323219.282747] raid1: Operation continuing on 1 devices.
[323219.314934] RAID1 conf printout:
[323219.316738]  --- wd:1 rd:2
[323219.318302]  disk 0, wo:1, o:0, dev:sda
[323219.320309]  disk 1, wo:0, o:1, dev:sdc
[323219.322267] RAID1 conf printout:
[323219.324046]  --- wd:1 rd:2
[323219.325645]  disk 1, wo:0, o:1, dev:sdc

I can restart mdadm, I can even restart the entire system, but the
RAID stays faulted.

Granted, the system is 2.6.30-1 with mdadm 2.6.9-3, but I wouldn't
think this made a big difference.




also sprach Henrique de Moraes Holschuh <hmh@debian.org> [2009.07.19.0218 +0200]:
> > It's running software raid, so why is it locking up? I managed to
> 
> Because the EH can be quite... anoying to other ports in the same
> controller, especially if the controller is a shitty one (I don't know if
> that's the case).

Care to elaborate? I don't even know what "the EH" is.




also sprach Henrique de Moraes Holschuh <hmh@debian.org> [2009.07.19.0233 +0200]:
> > >I have never seen anything like this.  Do you have any daemons
> > >trying to do "hotspare" services for MD?  THAT could be it...
> > 
> > Not that I'm aware of. How would I check?
> 
> A bug in mdadm --monitor might cause it, I think.  But I have
> never heard of it happening.

I don't see how --monitor could have anything to do with it.




also sprach Seth Mattinen <sethm@rollernet.us> [2009.07.19.0237 +0200]:
> I have a 600k capture file off the serial console if you're
> interested. Unfortunately it's a production system so I can't play
> with it.

Sure. ftp://ftp.madduck.net/incoming




also sprach Henrique de Moraes Holschuh <hmh@debian.org> [2009.07.19.0246 +0200]:
> I wouldn't be able to help you much with it, but either the mdadm
> maintainer (IF "ps auxwww | grep mdadm" tells you mdadm is
> running) or a bug in bugzilla.kernel.org directed to the md driver
> maintainer should be able to get you someone who can ;)

I'd write to linux-raid@vger.kernel.org first (please CC me), or
file a bug against mdadm in the Debian bug tracker.

Cheers,

-- 
 .''`.   martin f. krafft <madduck@d.o>      Related projects:
: :'  :  proud Debian developer               http://debiansystem.info
`. `'`   http://people.debian.org/~madduck    http://vcs-pkg.org
  `-  Debian - when you have better things to do than fixing systems
 
"i never go without my dinner. no one ever does, except vegetarians
 and people like that."
                                                        -- oscar wilde

Attachment: digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/)

Reply to:

Follow-Ups:
- Re: Why didn't software RAID detect a faulty drive?
  - From: Seth Mattinen <sethm@rollernet.us>

References:
- Why didn't software RAID detect a faulty drive?
  - From: Seth Mattinen <sethm@rollernet.us>

Prev by Date: Re: Why didn't software RAID detect a faulty drive?
Next by Date: Re: ISPmail Lenny tutorial ready
Previous by thread: Re: Why didn't software RAID detect a faulty drive?
Next by thread: Re: Why didn't software RAID detect a faulty drive?
Index(es):
- Date
- Thread