[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Paranoia about DegradedArray



On Wednesday 29 October 2008, Hendrik Boom wrote:
> On Wed, 29 Oct 2008 13:00:25 -0400, Hal Vaughan wrote:
> > On Wednesday 29 October 2008, Hendrik Boom wrote:
> >> I got the message (via email)
> >>
> >> This is an automatically generated mail message from mdadm running
> >> on april
> >>
> >> A DegradedArray event had been detected on md device /dev/md0.
> >>
> >> Faithfully yours, etc.
> >>
> >> P.S. The /proc/mdstat file currently contains the following:
> >>
> >> Personalities : [raid1]
> >> md0 : active raid1 hda3[0]
> >>       242219968 blocks [2/1] [U_]
> >>
> >> unused devices: <none>
> >
> > You don't mention that you've checked the array with mdadm --detail
> > /dev/md0.  Try that and it will give you some good information.
>
> april:/farhome/hendrik# mdadm --detail /dev/md0
> /dev/md0:
>         Version : 00.90.03
>   Creation Time : Sun Feb 19 10:53:01 2006
>      Raid Level : raid1
>      Array Size : 242219968 (231.00 GiB 248.03 GB)
>     Device Size : 242219968 (231.00 GiB 248.03 GB)
>    Raid Devices : 2
>   Total Devices : 1
> Preferred Minor : 0
>     Persistence : Superblock is persistent
>
>     Update Time : Wed Oct 29 13:23:15 2008
>           State : clean, degraded
>  Active Devices : 1
> Working Devices : 1
>  Failed Devices : 0
>   Spare Devices : 0
>
>            UUID : 4dc189ba:e7a12d38:e6262cdf:db1beda2
>          Events : 0.5130704
>
>     Number   Major   Minor   RaidDevice State
>        0       3        3        0      active sync   /dev/hda3
>        1       0        0        1      removed
> april:/farhome/hendrik#
>
>
>
> So from this do I conclude that /dev/hda3 is still working, but that
> it's the other drive (which isn't identified) that has trouble?
>
> I'm a bit surprised that none of the messages identifies the other
> drive, /dev/hdc3.  Is this normal?  Is that information available
> somewhere besides the sysadmin's memory?

Luckily it's been at least a couple months since I worked with a 
degraded array, but I *thought* it listed the failed devices as well.  
It looks like the device has not only failed but been removed -- is 
there a chance you removed it after the failure, before running this 
command?


> > I've never used /proc/mdstat because the --detail option gives me
> > more data in one shot.  From what I remember, this is a raid1,
> > right?  It looks like it has 2 devices and one is still working,
> > but I might be wrong. Again --detail will spell out a lot of this
> > explicitly.
> >
> >> Now I gather from what I've googled that somehow I've got to get
> >> the RAID to reestablish the failed drive by copying from the
> >> nonfailed drive. I do believe the hardware is basically OK, and
> >> that what I've got is probably a problem due to a power failure 
> >> (We've had a lot of these recently) or something transient.
> >>
> >> (a) How do I do this?
> >
> > If a drive has actually failed, then mdadm --remove /dev/md0
> > /dev/hdxx. If the drive has not failed, then you need to fail it
> > first with --fail as an option/switch for mdadm.
>
> So presumably the thing to do is
>    mdadm --fail /dev/md0 /dev/hdc3
>    mdadm --remove /dev/md0 /dev/hdc3
> and then
>    mdadm --add/dev/md0 /dev/hdc3

I think there's a "--readd" that you have to use or something like that, 
but I'd try --add first and see if that works.  You might find that 
hdc3 has already failed and, form the output above, looks like it's 
already been removed.

> Is the --fail really needed in my case?  the --detail option seems to
> have given /dev/hdc3 the status of "removed" (although it failed to
> mention is was /dev/hdc3).

I've had trouble with removing drives if I didn't manually fail them.  
Someone who knows the inner workings of mdadm might be able to provide 
more information on that.

> >> (b) is hda3 the failed drive, or is it the one that's still
> >> working?
> >
> > That's one of the things mdadm --detail /dev/md0 will tell you.  It
> > will list the active drives and the failed drives.
>
> Well.  I'm glad I was paranoid enough to ask.  It seems to be the
> drive that's working.  Glas I didn't try to remove and add in *that*
> one.

Yes, paranoia is a good thing in system administration.  It's kept me 
from severe problems previously!


Hal


Reply to: