Re: Re: Just received a fail event from mdadm (UncorrectableError), is my drive dead?

To: debian-isp@lists.debian.org
Subject: Re: Re: Just received a fail event from mdadm (UncorrectableError), is my drive dead?
From: "Mike Garey" <random51k@gmail.com>
Date: Fri, 15 Dec 2006 23:31:14 -0500
Message-id: <[🔎] c79e949d0612152031k28c848e9vf6c83866520dc6cd@mail.gmail.com>
In-reply-to: <[🔎] 20061206212521.GA9358@taz.net.au>
References: <c79e949d0610301031t2821bf30xf69ad2d36807d646@mail.gmail.com> <99E022B4C29BAC593D7C048D@dhcp-2-206.wgops.com> <20061030190053.GA19835@apep.keimel.com> <20061030231420.GA15604@piper.madduck.net> <[🔎] c79e949d0612051557t745fcc31o54f7bbf27700bc98@mail.gmail.com> <[🔎] 20061206212521.GA9358@taz.net.au>

Alright, continuing on the saga of my failed hard drive, I decided to
send back the slightly smaller 160 gig drive and exchanged it for a
250 gig western digital.  I just installed the new drive as hdc,
copied over the partition table from hda (the 160 gig drive) using "
sudo -d /dev/hda | sfdisk /dev/hdc" and then re-added hdc to the array
using "mdadm --add /dev/md0 /dev/hdc1".  I checked /proc/mdstat and
everything seemed to be running fine.

After about an hour, I decided to check the status to see if the
drives had finished syncing, but I wasn't able to connect through ssh,
so I took a trip over ot the server room to see the following (keep in
mind that hda was the previously _working_ disk, it was hdc that had
failed and was being replaced):

hda: dma_intr: status = 0x51 {driveready seekcomplete error }
hda: dma_intr: error = 0x40 { uncorrectable error }

ide failed opcode was: unknown
raid1: hda: unrecoverable i/o read error for block...
<snip>
md: md0 sync done
raid1 conf printout
<snip>
md: syncing raid array md0
hda: dma_timer_expiry: dma status = 0x21
<snip>
hda: dma disabled
ide0 reset: success

irq timeout: status = 0xd0 { busy }
ide failed opcode was: unknown
ide: reset: success
<keeps looping.. >

So it looks like while I was syncing the two drives, hda decided to
die.. I'm not exactly sure if the sync happened successfully, or hda
started encountering errors before it completed.. All I know is that
it doesn't seem like a good sign when your raid array has only one
disk, which starts producing errors in the midst of trying to re-add
another disk.  In any case, I decided to pull out the power,
disconnect hdc and then restart.  The machine has since booted without
issue from hda, so it seems to be working, although it doesn't fill me
with very much confidence..

Can anyone give me some suggestions on what I should do next?  Should
I try connecting only the new 250 gig drive (hdc) and seeing if it
boots, and if so, disconnect hda, buy another 250 gig replacement
drive, and then try to sync them? Or could this be a bad idea if hdc
didn't fully sync the first time?

I've also got a backup of the entire machine that I perform every
night.. Should I just get another 250 gig drive, copy over the
contents of the backup to one of these new drives and use them in the
array?

In other words, it seems as though hda should be considered
compromised and to get rid of it as soon as possible.. Is this
assumption correct?  For those of you who are curious, both of the 160
gig drives that I was using were made by Maxtor.. I don't think I'll
be buying any more Maxtor drives, that's for sure (I've also had bad
experience in the past with Maxtor which is why I tend to stay away
from them, but when I built this array, these drives were already
available and waiting to be used).

I've included the contents of /var/log/messages below, as well as the
output of smarctl

Any help is greatly appreciated.  Thanks,

Mike

===================================
output from smartctl -a /dev/hda

SMART Error Log Version: 1
Warning: ATA error count 153 inconsistent with error log pointer 5

ATA Error Count: 153 (device log contains only the most recent five errors)

Error 153 occurred at disk power-on lifetime: 3375 hours (140 days + 15 hours)
 When the command that caused the error occurred, the device was in
an unknown state.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 3a 7f f3 96 e0  Error: UNC 58 sectors at LBA = 0x0096f37f = 9892735

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 25 00 40 7f f3 96 e0 08      00:49:23.568  READ DMA EXT
 25 00 48 77 f3 96 e0 08      00:49:22.496  READ DMA EXT
 25 00 50 6f f3 96 e0 08      00:49:21.424  READ DMA EXT
 25 00 58 67 f3 96 e0 08      00:48:14.816  READ DMA EXT
 25 00 60 5f f3 96 e0 08      00:48:13.744  READ DMA EXT

Error 152 occurred at disk power-on lifetime: 3375 hours (140 days + 15 hours)
 When the command that caused the error occurred, the device was in
an unknown state.
===================================

the last time I ran smartctl on hda was after hdc died and it reported
no errors..

===================================
output of /var/log/messages
Dec 15 19:32:50 asterisk kernel:  hdc: unknown partition table
Dec 15 19:32:53 asterisk kernel:  hdc: hdc1 hdc2 < hdc5 >
Dec 15 19:39:01 asterisk /USR/SBIN/CRON[7034]: (root) CMD (  [ -d
/var/lib/php5 ] && find /var/lib/php5/ -type f -cmin
+$(/usr/lib/php5/maxlifetime) -print0 | xargs -r -0 rm)
Dec 15 19:41:01 asterisk kernel: md: bind<hdc1>
Dec 15 19:41:01 asterisk kernel: RAID1 conf printout:
Dec 15 19:41:01 asterisk kernel:  --- wd:1 rd:2
Dec 15 19:41:01 asterisk kernel:  disk 0, wo:0, o:1, dev:hda1
Dec 15 19:41:01 asterisk kernel:  disk 1, wo:1, o:1, dev:hdc1
Dec 15 19:41:01 asterisk kernel: md: syncing RAID array md0
Dec 15 19:41:01 asterisk kernel: md: minimum _guaranteed_
reconstruction speed: 150000 KB/sec/disc.
Dec 15 19:41:01 asterisk kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for reconstruction.
Dec 15 19:41:01 asterisk kernel: md: using 128k window, over a total
of 158577472 blocks.
Dec 15 20:00:28 asterisk smartd[6783]: Device: /dev/hda, SMART
Prefailure Attribute: 8 Seek_Time_Performance changed from 231 to 245
Dec 15 20:00:29 asterisk smartd[6783]: Device: /dev/hdc, SMART Usage
Attribute: 194 Temperature_Celsius changed from 132 to 118
Dec 15 20:05:22 asterisk kernel: hda: dma_intr: status=0x51 {
DriveReady SeekComplete Error }
Dec 15 20:05:22 asterisk kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=93778821, high=5, low=9892741,
sector=93777599
Dec 15 20:05:22 asterisk kernel: ide: failed opcode was: unknown
Dec 15 20:05:22 asterisk kernel: end_request: I/O error, dev hda,
sector 93777599
Dec 15 20:05:24 asterisk kernel: hda: dma_intr: status=0x51 {
DriveReady SeekComplete Error }
Dec 15 20:05:24 asterisk kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=93778821, high=5, low=9892741,
sector=93777607
Dec 15 20:05:24 asterisk kernel: ide: failed opcode was: unknown
===================================

On 12/6/06, Craig Sanders <cas@taz.net.au> wrote:

On Tue, Dec 05, 2006 at 06:57:29PM -0500, Mike Garey wrote:
> sfdisk: ERROR: sector 0 does not have an msdos signature
> /dev/hdc: unrecognized partition table type
> Old situation:
> No partitions found
> Warning: given size (317155167) exceeds max allowable size (312576642)
>
> sfdisk: bad input
>
> dmesg shows the following:
>
> hda: 320173056 sectors (163928 MB) w/7936KiB Cache, CHS=19929/255/63,
> UDMA(100)
> hdc: 312581808 sectors (160041 MB) w/8192KiB Cache, CHS=19457/255/63,
> UDMA(100)
>
> so it seems the new drive (hdc) is slightly smaller than the current
> working drive (hda)..  Does anyone have any suggestions of what to do?
> I guess I could manually create the partition table to be similar to
> the current drive, then copy over the contents of the current drive to
> the new drive, then copy the partition table from the new drive to the
> current drive and then get them to resync.. But I'm wondering if maybe
> there's an easier way.  If anyone has any advice, please let me know..

either that, or get another drive which is at least as big as hda.  that
probably means a 200GB drive.

in other words, you have a choice between downtime and expense.

who knows, you may be able to get a refund for the 160GB drive you
bought. or sell it to someone. at worst you could use it for something
else....the office mp3 jukebox machine, for instance :)


craig

--
craig sanders <cas@taz.net.au>           (part time cyborg)


--
To UNSUBSCRIBE, email to debian-isp-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Reply to:

References:
- Re: Just received a fail event from mdadm (UncorrectableError), is my drive dead?
  - From: "Mike Garey" <random51k@gmail.com>
- Re: Just received a fail event from mdadm (UncorrectableError), is my drive dead?
  - From: Craig Sanders <cas@taz.net.au>

Prev by Date: Re: Two gateways on same network
Next by Date: Anime.MS
Previous by thread: Re: Just received a fail event from mdadm (UncorrectableError), is my drive dead?
Next by thread: Optimal Postgres setup with 6 disks total (including OS)
Index(es):
- Date
- Thread