[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: OT: IDE/Bad Blocks: Call for mental assistance



On Sun, Jul 13, 2003 at 01:51:23PM +0200, martin f krafft wrote:
> Folks,
> 
> Over the past year, I have replaced something around 20 IDE
> Harddrives in 5 different computers running Debian because of drive
> faults. I know about IDE and that it's "consumer quality" and no
> more, but it can't be the case that the failure rate is that high.
> 
> The drives are mostly made by IBM/Hitachi, and they run 24/7, as the
> machines in question are either routers, firewalls, or servers.
> 
> Replacing a drive would be a result of symptoms, such as frequent
> segmentation faults, corrupt files, and zombie processes. In all
> cases, I replaced the drive, transferred the data (mostly without
> problems), got the machine back into a running state, then ran
> `badblocks -svw` on the disk. And usually, I'd see a number of bad
> blocks, usually in excess of 100.
> 
> The other day, I received a replacement drive from Hitachi, plugged
> it into a test machine, ran badblocks and verified that there were
> no badblocks. I then put the machine into a firewall, sync'd the
> data (ext3 filesystems) and was ready to let the computers be and
> head off to the lake... when the new firewall kept reporting bad
> reloc headers in libraries, APT would stop working, there would be
> random single-letter flips in /var/lib/dpkg/available (e.g. swig's
> Version field would be labelled "Verrion"), and the system kept
> reporting segfaults. I consequently plugged the drive into another
> test machine and ran badblocks -- and it found more than 2000 -- on
> a drive that had non the day before.
> 
> Just now, I got another replacement from Hitachi (this time it
> wasn't a "serviceable used part", but a new drive), and out of the
> box, it featured 250 bad blocks.
> 
> My vendor says that bad blocks are normal, and that I should be
> running the IBM drive fitness test on the drives to verify their
> functionality. Moreover, he says that there are tools to remap bad
> blocks.

All hard drives have a certain number of defects when new, due to the
difficulty of making the platters absolutely perfect. The location of
these defects is stored in a table on the drive, and the drive then
doesn't use those areas. This is totally transparent, and there's no
way badblocks should know about these defects. (Some drives even have
a dump of this defect list printed on the label, although not very
often these days.)

For your vendor to be telling you this as an explanation for what
you're experiencing suggests to me that either he doesn't know very
much about what he's selling or that he's pulling your plonker.

> My understanding was that EIDE does automatic bad sector remapping,
> and if badblocks actually finds a bad block, then the drive is
> declared dead. Is this not the case?

SCSI does this. In addition to the manufacturer's defect list referred
to above, SCSI drives have a separate grown defect list used for
automatic bad sector remapping. It's possible to dump the contents of
this list eg. with scsiinfo, and when the number of grown defects
starts increasing you get a chance to replace the drive before the
table fills up. I could be wrong, but I don't think EIDE does this.

> The reason I am posting this is because I need mental support. I'm
> going slightly mad.

I've been mad for years, absolutely f**king years, I've been over the
edge for yonks. :-)

> I seem to be unable to buy non-bad IDE drives,
> be they IBM, Maxtor, or Quantum. Thus I spend excessive time on
> replacing drives and keeping systems up by brute-force. And when
> I look around, there are thousands of consumer machines that run
> day-in-day-out without problems.

Are they all coming from the same source, or if you get them from
different sources, is there a common link in the delivery chain?

> It may well be that Windoze has better error handling when the
> harddrive's reliability degrades (I don't want to say this is a good
> thing). 

This is almost certainly not true at least as far as FAT32 is
concerned. The bad sector won't be marked as bad in the FAT until you
run Scandisk, so it'll just go on trying to use it and crashing on the
resultant errors.

> It may be that IDE hates me. I don't think it's my IDE
> controller, since there are 5 different machines involved, and the
> chance that all IDE controllers report bad blocks where there aren't
> any, but otherwise function fine with respect to detecting the
> drives (and not reporting the dreaded dma:intr errors).
>
> So I call to you and would like to know a couple of things:
> 
>   - does anyone else experience this?

It is something I associate with secondhand drives that may not
necessarily have been handled with due care.

>   - does anyone know why this is happening?
>   - why is this happening to me?

I like to festoon my hard drives with fans (run off 5V instead of 12V
to keep the noise down) as they can object to the temperatures they
can heat themselves up to - but I think we can rule out heat in the
case of your brand new Hitachi drive which is knackered as soon as you
start it up. I think it is also unlikely to be static damage
accidentally inflicted by you - that might well cause various drive
errors, but not of the increasing-number-of-bad-blocks variety. If the
electronics were sufficiently screwed as to send commands to the
mechanism that would cause it to damage itself, it's unlikely the
drive would do anything remotely sensible at all. And they're pretty
well sealed against environmental contamination. My HDs are subjected
not only to cigarette smoke but to the extremely fine dust with which
pigeons maintain the condition of their feathers, and it doesn't seem
to bother them.

Two possibilities which occur to me are: 

Dirty mains - maybe you could try running a machine in a different part
of town. I think it unlikely that this would selectively affect your
HDs though.

Transit damage - maybe your vendor's warehouse staff tend to sling
boxes around without thought for their contents. Or if you get them
delivered, maybe the carrier's staff are careless.

>   - is it true that bad blocks are normal and can be handled
>     properly?

No. Once a drive starts to get bad blocks their number tends to
increase exponentially. The only safe thing to do is replace the
drive. SCSI drives can remap bad blocks transparently - as long as
they 'catch' the dodgy block before it can't be read at all. IMO this
feature should be used to enable you to find out that the drive is
going dodgy and hopefully replace it before you lose any data, not to
enable you to blithely forget that bad blocks exist :-)

>   - can bad blocks arise from static discharge or impurities? 

Static discharge IMO is more likely to cause faults other than bad
blocks, and as I say particulates don't seem to be a problem. If
they've been stored in damp/humid conditions, that could do them a lot
of no good.

>     when
>     i replace disks, I usually put the new one into the case
>     loosely and leave the cover open. The disk is not subjected to
>     any shocks or the like, it's sitting still as a rock, it's just
>     not affixed.

That shouldn't be a problem if it's not actually rattling/buzzing
against something. Mounting a piece of vibrating machinery rigidly can
actually increase the vibration-induced component of the loads on the
bearings.

> I will probably never buy IDE again. But before I bash companies
> like Hitachi for crap quality control, I would like to make sure
> that I am not the one screwing up.

If someone's screwing up, it sounds to me like it's someone between
Hitachi etc. and you.

-- 
Pigeon

Be kind to pigeons
Get my GPG key here: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x21C61F7F

Attachment: pgpawTdQ79SOV.pgp
Description: PGP signature


Reply to: