Re: Suggestions wanted for
Joe Emenaker wrote:
> pinniped wrote:
>> You have to be pretty unlucky to get a bad drive (or controller) -
>> but it happens. However, I've seen problems like these come up many
>> times and usually it's because the power supply simply cannot meet
>> the needs of everything connected to it.
> I was starting to suspect the PS, too, but wasn't sure how to go about
> monitoring it. Fortunately, I've got a Kill-A-Watt, so I can see how
> much power the PS is drawing from the wall. I've also got one of those
> ATX power-supply testers, but I don't know what kind of load it places
> on the PS, so I don't know if it will tell me much about whether the
> PS is just not able to pump out enough juice for all of the drives.
> Still, I'm mindful of David Agans' debugging advice: "Make it Fail".
> I'd like to have a smoking gun. If it *is* the PS, then I'd like to
> actually see the voltages drop when the system is under load. Any
> suggestions on how to go about that? Got any suggestions for one of
> those front-panel LCD dealies, or should I just go with software, with
> something like lm-sensors?
Finding anything conclusive from the PSU is tough without spending some
money on hardware to monitor the rails. Mostly because your PSU might be
rated at a certain level, but due to heat and spikes during it's life it
may get nowhere near that. Even brand new PSU's rarely live up to their
rated values, but you have some options here.
Firstly, the addition of another drive on the PSU _should_ make the
failure more pronounced if it is the PSU. Also running all 4 drives
simultaniously OUT of the array should also produce the errors on every
drive. This would be the place I would start, if you can max out your
CPU's to pull on the PSU as much as possible thats even better.
Now assuming you get here and the errors are showing on each drive, then
remove as much hardware as possible and try to do it while it's at it's
bare minimum (one cpu (if you have multiple), one HDD, one stick of
memory, a video card if you don't have on-board, NO SATA cage (I've seen
them do interesting things in the past too)), run the same tests and see
if you get the errors, if so, controller is a likely cause, maybe cpu
but they so rarely fail it would be unlikely.
>> As for the disks, I'd suggest testing them individually rather than
>> trying to test them in a RAID or even while connected to the RAID
> Just to clarify, the controller isn't doing the RAID... it's the Linux
> md driver(s). So, I want to test them *in* and *out* of the RAID just
> to make sure that it's not some kooky problem with the RAID layer.
On this note, most HDD manufacturers have testing and certification
tools on their website, and usually it's something pretty close to what
they use for initial diagnosis when drives are returned. If you can get
the drives to another machine do a full test with said tool and rule out
your drives. This is fast. realitivly easy and can give you good piece
of mind that it's not the drives
Also you have your tests available. You already know exactly what to do
to get the corruption, it'll likely be more reliable to use the same
method you used in the past to confirm the problem and the solution than
anything synthetic. In saying that some testing with other apps may
reveal a much faster way to get to the problem
Finally depending on the driver you might have found a hardware bug that
the driver doesn't know how to work around yet, thats only really likely
if it's a brand new controller with a brand new chip revision though.