Which disk is failing?

To: Debian users list <debian-user@lists.debian.org>
Subject: Which disk is failing?
From: Gregory Seidman <gsslist+debian@anthropohedron.net>
Date: Thu, 22 Jul 2010 07:38:02 -0400
Message-id: <[🔎] 20100722113802.GB10802@anthropohedron.net>
Mail-followup-to: debian-user@lists.debian.org
Reply-to: gsslist+debian@anthropohedron.net

I have a RAID1 (using md) running on two USB disks. (I'm working on moving
to eSATA, but it's USB for now.) That means I don't have any insight using
SMART. Meanwhile, I've been getting occasional fail events. Unfortunately,
I don't get any information on which disk is failing.

When the system comes up, it seems to be entirely random which disk comes
up as /dev/sda and which comes up as /dev/sdb. In fact, since my root disk
is on SATA, at least one time it came up as /dev/sda and the USB drives
came up as /dev/sdb and /dev/sdc, though I think that was under a different
kernel version. When I get a failure email, it tells me that it might be
due to /dev/sda1 failing -- except when it tells me that it might be due to
/dev/sdb1 failing. When things are working, mdadm -D /dev/md0 looks like
this:


/dev/md0:
        Version : 00.90
  Creation Time : Wed Feb 22 20:50:29 2006
     Raid Level : raid1
     Array Size : 312496256 (298.02 GiB 320.00 GB)
  Used Dev Size : 312496256 (298.02 GiB 320.00 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu Jul 22 07:30:46 2010
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : e4feee4a:6b6be6d2:013f88ab:1b80cac5
         Events : 0.17961786

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8        1        1      active sync   /dev/sda1

When it fails, however, the device names disappear and it just tells me
it's clean, degraded and shows an active disk, a removed disk, and a faulty
spare without any device names.

I even tried doing dd if=/dev/md0 of=/dev/null to see if I could get the
light flickering on one and not the other, but I just get I/O errors. Once
a disk fails, the RAID seems to go into a nasty state where it reads
properly through the crypto loop and LVM I have on top of it, but the
filesystems become read-only and the block devices just give errors. Worse,
the first indication (even before the mdadm email) that something is wrong
is a message to console that an ext3 journal write failed.

What I've been doing (which makes me tremendously uncomfortable since I
know a disk is failing) is to reboot and bring everything back up. This has
been working, but I know it's just a matter of time before the failing disk
becomes a failed disk. I could wait until then, since presumably I'll then
know which is which, but who knows what data corruption is possible between
now and then?

So, um, help?

--Greg

Reply to:

Follow-Ups:
- Re: Which disk is failing?
  - From: Michal <michal@sharescope.co.uk>
- Re: Which disk is failing?
  - From: Stan Hoeppner <stan@hardwarefreak.com>
- Re: Which disk is failing?
  - From: Camaleón <noelamac@gmail.com>

Prev by Date: Re: Iceweasel locks up after connecting to radio
Next by Date: Re: Iceweasel locks up after connecting to radio
Previous by thread: [newbie] Logwatch + Postfix + Mailman
Next by thread: Re: Which disk is failing?
Index(es):
- Date
- Thread