[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Signs of hard drive failure?



On 10/20/19 10:21 PM, Ken Heard wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

In the past week or so some in my computer procedures have become
sluggish, and some others have not worked at all.

For example the following script works:

#! /bin/bash
CURPWD=$PWD
cd /home/ken
tar -czf /media/fde/backups/kfinancescurrent.tgz --wildcards\
- --exclude-from=docs/tarlists/kfinancesarchive.lst docs/01-kens/Finances
cd $CURPWD

Whereas this one does not work now but did two weeks ago:

#!/bin/bash
# Shell script to create a tgz file for the contents of the
# /home/ken/docs and the /usr/local/ directories,
# minus the files in file /home/ken/docs/tarlists/kexcludedocs.lst
# This script may be run from any directory to which the user has
write # permission.

# Start by creating a variable with the current directory.
CURPWD=$PWD
# Change directory to /
cd /
# Create the tarball.
tar -czpf media/fde/backups/kdocsfull.tgz  -X
/home/ken/docs/tarlists/kdocsfullexclude.lst -T
/home/ken/docs/tarlists/kdocsfullinclude.lst
# Return to the starting directory.
cd $CURPWD

Now when I try to run it it returns the following:

ken@SOL:~$ tar -czpf media/fde/backups/kdocsfull.tgz  -X
/home/ken/docs/tarlists/kdocsfullexclude.lst -T
/home/ken/docs/tarlists/kdocsfullinclude.lst
tar (child): media/fde/backups/kdocsfull.tgz: Cannot open: No such
file or directorytar: home/ken/docs: Cannot stat: No such file or
directory
tar: usr/local: Cannot stat: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now

All the files/directories which this script cannot stat do in fact
exist, proven by the fact that the first script uses the same
directories, but different files in those directories.

As these symptoms can indicate I think hard drive failure or potential
failure I am trying to explore this possibility.

I am using Stretch and TDE with two 2 TB Seagate Barracuda hard drives
in a RAID 1 configuration.  Both drives were purchased at the same
time and were installed in the box on 2016-05-30.  Although hree and
one half years ago, this particular box is only used six months out of
twelve.  I would not have thought that drives -- if they last the
first year -- would not show signs of failure after 1.75 years.

In any event, I ran smartctl -a on both drives.  For both "SMART
overall-health self-assessment test result [was] 'PASSED'"
Nevertheless for all the specific attributes, identical for both
drives, three of them had the indication 'Pre-fail' and 'Old-age' for
the other nineteen.

I also ran 'badblock -v'.  Both had 1953514583 blocks.  The test for
/dev/sda was interrupted at block 738381440, and for /dev/sdb at block
42064448.

I am not sure what all these test results mean, or how accurate they
would be for indicating if or when a failure would occur. It did occur
to me that if after copying all my data files to an external hard
drive I could replace the /dev/sdb device with a new one and copy all
the data in /dev/sda on the assumption with a new and presumably
pristine drive the OS given the choice would access the data it wanted
from the drive which responded the quicker.

If that approach worked I could replace the other drive in another
year or two (really one year of use) so that both drives in the RAID 1
would not be of the same age.

Comments and advice as to the best way of getting this computer back
to 'normal' to the extent that such a state could ever be 'normal'.

Regards Ken

-----BEGIN PGP SIGNATURE-----

iF0EARECAB0WIQR9YFyM2lJAhDprmoOU2UnM6QmZNwUCXa0WDwAKCRCU2UnM6QmZ
N6elAJ0dWU0ElkIqvRebe8xGCrg77Tl0IQCeIj94dVV7aeBfjBq6Mpna/Jol/J0=
=PrGY
-----END PGP SIGNATURE-----

I would first check if the raid was working.  Use "cat /proc/mdstat".  You will see something like this for each raid drive configured:

md0 : active raid1 sdb1[3] sda1[2]
      28754230 blocks super 1.2 [2/2] [UU]

Make sure that both U's are there.  If not be careful because the raid is operating on one disk.  Before you reboot copy all the important data from that raid drive.

Next use smartctl to do a long self test.  Use "smartctl -t long /dev/sda".  You can still use the machine but it will slow the test down.  The tests take a long time and smartctl will estimate how long.  Then do the second drive "smartctl -t long /dev/sdb".

If these pass then you could try booting with a system rescue CD.  First check what drive names it has used by running "ls /dev/md*".  You will see something like /dev/md0 or /dev/md123.  Now check the filesystem on the raid drive with "fsck -f /dev/mdx" replaceing x with what you found in the previous command.

That should keep you bust for a while.  Let the list know what you found.


--


...Bob

Reply to: