[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: trouble HP SmartArray 6400



On Mon, 2008-07-28 at 09:41 -0300, Lucas Mocellin wrote:
> Hi,
> 
> I'm having some troubles with the DP SmartArray 6400 controller.
> 
> Before I had a failed drive, so I repalced this drive, and now I'm
> getting this error:
> 
> sp02:~# hpacucli
> => ctrl slot=4 pd all show
> 
> Smart Array 6400 in Slot 4
> 
>    array A
> 
>       physicaldrive 2:0   (port 2:id 0 , Parallel SCSI, 72.8 GB, OK)
>       physicaldrive 2:1   (port 2:id 1 , Parallel SCSI, 72.8 GB, OK)
>       physicaldrive 2:2   (port 2:id 2 , Parallel SCSI, 146.8 GB, OK)
>       physicaldrive 2:3   (port 2:id 3 , Parallel SCSI, 146.8 GB, OK)
>       physicaldrive 2:4   (port 2:id 4 , Parallel SCSI, 146.8 GB,
> Predictive Failure)
>       physicaldrive 2:5   (port 2:id 5 , Parallel SCSI, 146.8 GB, OK)
> 
> A "Predictive Failure", but I don't know what is this.
> 
> I searched at google but without answers..
> 
> Can somebody help me?
> 
> Thanks in advance,
> 
> Lucas.

Lucas,

I am CCing you also as this could be very bad for you.

I worked on Dell hardware, but I can tell you what a predictive failure
is.  It is one of two things:  

1.  The smart hardware on the HD is reporting that the drive failure is
eminent.  It may last for hours or months, but it is in a state that
says it is about to fail.

2.  I don't remember what the chipset is for Dell raid controllers is,
but I bet it is the same mfg as HP.  Sometimes when the meta-data gets
corrupted (after a failure and a HD is replaced) the strip is punctured
(google punctured strip).  If this is the case, no matter what you do PD
4 will never rebuild correctly and it will always report a predictive
failure.

You did not say if you replaced pd4 or not.  If you did, there is a
chance that pd4 is just bad.  If you did not, there is a greater chance
that pd4 is bad.  The only things you can do now is replace pd4 and see
if it rebuilds correctly.  If it does not and still shows a predictive
failure there is only one recourse.  Backup all the data.  Break the
raid, rebuild the raid, restore the data. You MIGHT get away with
clearing the strip, then rebuilding the strip in the controller and in a
perfect world, all the data will be there.  Slim chance.

If your meta-data is corrupted, you are now gambling with your data.
With out respect to pd4 being in a predictive failure state or not, make
a complete backup and prepare for complete loss of that raid.  A
punctured stripe means you have no parity to rebuild from.  Or, to put
it differently, a bit of data was made into garbage, then copied as part
of the parity onto the strip.  The corrupted parity strip faithfully
rebuild the array, only this time it included that piece of bogus data.
Everything will work just fine until the machine tries to access that
bit, expecting to find some sort of data it stored there, only to find
nonsensical data, then WHAM!  Lock up.  You can also experience
seemingly random HD failures, sometimes multiple hd will get kicked from
the array.  Needles to say, this plays havoc with data preservation.  

This could be as simple as replacing pd4 and rebuilding (if it is just a
SMART error), or is could be a prelude to complete data lose.  You have
to ask yourself, "Do you feel lucky, Well, do you?"

The above was learned through two years working for Dell at the
Gold/Platinum level for server support.  Failed HDs comprised about 80%
of the job. 

HTH
-- 
Damon L. Chesser
damon@damtek.com
http://www.linkedin.com/in/dchesser

Attachment: signature.asc
Description: This is a digitally signed message part


Reply to: