Resolution & recovery: Re: Slow disk - hdparm, S.M.A.R.T, badblocks, what else?

To: debian-user <debian-user@lists.debian.org>
Subject: Resolution & recovery: Re: Slow disk - hdparm, S.M.A.R.T, badblocks, what else?
From: "Karsten M. Self" <kmself@ix.netcom.com>
Date: Sat, 10 Jul 2004 17:41:58 -0700
Message-id: <[🔎] 20040711004158.GB21209@ix.netcom.com>
Mail-followup-to: debian-user <debian-user@lists.debian.org>
In-reply-to: <[🔎] 20040709091702.GG1971@ix.netcom.com>
References: <[🔎] 20040708105908.GA1971@ix.netcom.com> <[🔎] 200407081759.57106.dmmcintyr@users.sourceforge.net> <[🔎] 20040709091702.GG1971@ix.netcom.com>

on Fri, Jul 09, 2004 at 02:17:04AM -0700, Karsten M. Self (kmself@ix.netcom.com) wrote:
> on Thu, Jul 08, 2004 at 05:59:57PM -0400, Silvan (dmmcintyr@users.sourceforge.net) wrote:
> > On Thursday 08 July 2004 06:59 am, Karsten M. Self wrote:
> >
> > > - Bad drive?
> >

> That's pretty much the conclusion I'm coming to. No more results to
> publish right now, but some more poking around with SMART tells me the
> drive's risking imminent failure. I'm backing data off of it now (very
> slowly), hope to replace it tomorrow.

...and now the (mostly) conclusion.

I still haven't run final conclusive tests on the old drive, but with a
new Maxtor 80 GB 7200 RPM 8 MB cache mumble disk in the box, I'm getting
hdparm disk read results in the 48 - 51 MiB/sec range. *Vastly* better.
I'm also no longer hearing the clicks which were coming from the box
previously and which I wasn't sure were the (pretty much completely
unused) Zip drive, or the hard drive.

A few notes about S.M.A.R.T. (self monitoring, reporting, and analysis
technology), which I think a lot of folks know _about_ but few actually
know how to use.

- It's on pretty much every IDE hard drive sold in the past few years,
and all of 'em sold currently. SCSI III and better dittos.

- Here's 98.3% of what you need to know about SMART:

- There are two drive self tests (DSTs) which can be run. Short
(DST) and long (eNhanced DST or NDST). Nominally 2 and 27 minutes
each. Install the smartmontools package, and run them with:

# smartctl -t short <device>
# smartctl -t long <device>

...and access the results with:

# smartctl -a <device>

...where "<device>" is /dev/hd[a-h]

- The short test can rule *in* a bad drive, but is *not* definitive
in determining a drive is *good*. That is, it's got a relatively
high false negative rate in catching bad drives. This is good for
the manufacturers (fewer spurious returns) but means you have to
run additional diagnostics if the short test comes clean but
you've got concerns about the disk. The short test reads at least
the first 1.5 GB from the disk. Overall accuracy of this test is
60-70%

- The long test is far better at correctly identifying _bad_ drives,
with few false positives. Overall accuracy of this test is 95%.

- The other 2.7% is:

- smartmontool installs a daemon, smartd, which runs these tests
regularly: DST daily and NDST weekly. There's some impact on
disk performance, though these tests *can* be run on a live
system.

- There are additional attributes monitored, and you should go
through the 'smartctl -a' output to see what your drive has to
say about its current status and any logged errors.

- Drive lifetime is *highly* temperature sensitive. A 5°C
temperature rise from 25°C to 30°C reduces MTBF by 25%. Blow
those cases! MTBF also _falls_ dramatically with drive life.
Your 2 year old drive is statistically far more reliable than the
one fresh out of the box.

- As before: the smartmontools Sourceforge homepage has some
really good information, which is where I've got most of what I'm
discussing here. Read the PDFs, particular the Seagate
references.

http://smartmontools.sourceforge.net/

The main problem I encountered was that for my drive's failure mode, the
long test wasn't running in geological time. smartmontools diagnostics
indicated a "pass" on the short test, but with some marginal attributes.

However:

- The disk was replaced.

- No data were lost.

- We had relatively minimal downtime (one day) while the problem was
identified, repaired, and the system rebuilt.

I'd still like to drop the disk into another system and run Maxtor's
test utility (which I strongly suspect is an NDST).

I thought I'd also detail the system backup and restore process.

Short version.

- Back up critical data.
- Swap drives.
- Booting Knoppix, partition disk.
- Booting Knoppix, run debootstrap install.
- Point sources to local apt-cache proxy.
- Read in package list from old system: dpkg --get-selections < file.
- Reinstall packages with 'apt-get dselect-upgrade'
- Restore backed-up data.
- Modify /etc/fstab appropriately.
- Modify /boot and /boot/grub/menu.lst appropriately.
- Reboot to recovered system.
- Test drive performance.

Long description:

- When it became clear that there _was_ a problem with the drive, even
though full diagnostics were not available, I switched from "find
out what's wrong" to "salvage all data" mode. Fortunately, although
reads were slow, they were reliable. I got ~2 GiB transfered over
the LAN in about five hours (running unattended overnight).

- I backed up the following trees:

- /home
- /etc
- /root
- /boot
- /usr/local
- /var/www
- /var/log
- /var/www
- /var/backups

I later found that there was some data in /var/spool (my uptimed
records) which I didn't have. Should probably add that.

I'd also created but decided against restoring /var/lib. Most of
the state-related data here are created on install anyway.

- Backups were run from the root directory (to keep the full relative
path) with:

# tar cvf - <tree> | ssh user@remote 'cat > <path>/<tree>.tar.gz

This prompts for a password as the connection is made, then runs to
completion.

- I tested the integrity of the archives both on the remote host and
after copying them back to the damaged system with:

for f in *.tar.gz
do
echo -e "Testing $f ... \c"
tar tzvf $f >/dev/null && echo OK || echo Wups
fi

- I saved the package selection status with:

# dpkg --get-selections | ssh user@remote 'gzip > <path>/packages.gz

- At this point, the damaged system was powered down and the drive
replaced.

- The damaged system with new drive was booted with Knoppix.

- The drive was partitioned (I prefer fdisk)

- Reboot after running fdisk (recommended by some utilities, not sure
this is required). Still running Knoppix.

- Filesystems / swap partitions created.

- Create /mnt/target. Mount the new intended root partition here.

- Run debootstrap. I'd intended to run installations from my
apt-cache proxy but had to use http://ftp.us.debian.org/debian
instead. If anyone could straighten me out here, I'd prefer local
fetches:

# debootstrap woody . http://ftp.us.debian.org/

...generally, that's:

# debootstrap --arch <arch> <dist> <mountpoint> <archive>

This takes a while. Chat on #debian at irc.debian.org....

- With the base system installed, copy in files from /etc/apt/ on the
old system (effectively pointing my apt archive now to my apt-proxy
cache), read in the package list, and install about 580 packages. I
may have had to install apt first, not positive:

# zcat packages.gz | dpkg --set-selections
# apt-get -dy dselect-upgrade

This was some 510 MiB of data, and took a few hours. Got to look at
my LAN performance. 50 kB/s on a 100 Mbps hubbed LAN is far too
slow.

Note that I ran a download-only install. After this, commit with:

# apt-get dselect-upgrade

- While the above processes were running, I was also copying the
archived filesystems back over, testing integrity of the archived
data, and restoring /home and /usr/local. The /root, /boot, /etc,
and /var trees are recovered later.

- I also ran some preliminatry diagnostics on the new hard drive and
found performance was in the expected range -- 48 - 51 MiB/sec for
buffered disk reads. Knoppix's SMART utilities (using the older
UCSC SMART package) indicated normal behavior, and both short and
long disk tests were run and passed.

- With the system's prior packages (finally) installed, I restored the
/etc, /boot, /root, and /var data. For /etc, /boot, and /root, I
moved the existing data to /etc.bak, /boot.bak, /root.bak, etc. If
there are any discrepencies, I can go through and clean this up
later.

- The 'Knoppix' hostname worked its way into a few config files.

# find /etc -type f -print0 | xargs -0 grep -il knoppix

...and clean those up.

- Edit /etc/fstab to reflect current partitioning.

- Run 'grub-install /dev/hda'

- Modify kernels and re-run 'update-grub' to get a working GRUB
configuration.

- Reboot into new system and test services. Samba, Apache, SSH,
XDMCP, etc., working well.

Finished.

Oh hell, dawn on the horizon. Mmmuuusssttt sssllleeeeeeppp....

Peace.

--
Karsten M. Self <kmself@ix.netcom.com> http://kmself.home.netcom.com/
What Part of "Gestalt" don't you understand?
Backgrounder on the Caldera/SCO vs. IBM and Linux dispute.
http://sco.iwethey.org/

Attachment: signature.asc
Description: Digital signature

Reply to:

Follow-Ups:
- Re: Resolution & recovery: Re: Slow disk - hdparm, S.M.A.R.T, badblocks, what else?
  - From: "Jacob S." <stormspotter@6texans.net>

References:
- Slow disk - hdparm, S.M.A.R.T, badblocks, what else?
  - From: "Karsten M. Self" <kmself@ix.netcom.com>
- Re: Slow disk - hdparm, S.M.A.R.T, badblocks, what else?
  - From: Silvan <dmmcintyr@users.sourceforge.net>
- Re: Slow disk - hdparm, S.M.A.R.T, badblocks, what else?
  - From: "Karsten M. Self" <kmself@ix.netcom.com>

Prev by Date: Re: many packages depend on mozilla-browser but I build it from source
Next by Date: Re: Debian on a Dedicated Server
Previous by thread: Re: Slow disk - hdparm, S.M.A.R.T, badblocks, what else?
Next by thread: Re: Resolution & recovery: Re: Slow disk - hdparm, S.M.A.R.T, badblocks, what else?
Index(es):
- Date
- Thread