Re: Failing Hard Drive, or False Alarms?
Hi Alain,
On Thu, Sep 11, 2025 at 04:19:33PM +0100, alain williams wrote:
> In the general operation of a machine: how aware of the causes of a program
> failing would you be ? Most programs would generate some vague error message,
> the system might contain something more specific.
Are you asking what happens to userland in this scenario? The read
system call just gets the "I/O error" return value which you'd hope the
application is able to handle, though it won't have any specific
information as to why that happened. The system log will contain the
device and the LBA (sector address) that failed to be read.
> How do we determine *which* files need to be restored. I suspect this
> is more difficult than it seems.
Basically I advocate using some redundancy anywhere that is practical
because if you ever think you'll need to do this it will be a
significant use of your time, which likely costs you more than providing
the redundancy.
Having said that, it isn't hard to do this with ext* filesystems and so
probably not with XFS or similar. btrfs and zfs will tell you the files
that are corrupt of course, because they know. But otherwise as long as
you know where the filesystem starts and ends you can do the maths and
use a debugfs-like tool to answer the question.
I needed to do something like it once on a redundant system to work out
what was causing some strange behaviour:
https://strugglers.net/~andy/mothballed-blog/2021/07/24/resolving-a-sector-offset-to-a-logical-volume/
(It was I/O to an NVMe drive not on a 4k boundary)
If you do have redundancy then depending on how it's implemented it may
write back the correct data automatically, causing the drive to
reallocate a sector and life goes on.
You can force a sector write with:
# hdparm --yes-i-know-what-i-am-doing --write-sector 1234 /dev/sdX
which will write zeroes to sector 1234 of /dev/sdX. If the sector is
damaged this will force a reallocation as long as the drive has any
spare sectors, which should clear the SMART attributes and stop the
constant alerts.
SMART "selective" tests can read-test a range of sector addresses rather
than the whole drive like a "long" test does (which can take days).
Thanks,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Reply to: