[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: fsck error on boot: /dev/sda1: UNEXPECTED INCONSISTENCY and Partition 1 does not start on physical sector boundary



Hello,

On Fri, Mar 19, 2021 at 10:36:37PM +0500, Alexander V. Makartsev wrote:
> Personally, I don't think it is wise to throw away any HDD as soon as it
> gets a few pending bad blocks for whatever reason.

It really depends upon your risk stance.

At home, on my home fileserver, it has RAID, it has backups, so if a
HDD sees a few remapped sectors I'm not going to throw the HDD out.
When it starts seeing many many increasing numbers of remapped
sectors then yes it's being replaced. But indeed it can be many
years between picking up a few remapped sectors and complete
meltdown.

https://gist.github.com/grifferz/64808f61079fe610c6f21f03ac7fd1aa

$ sudo ./blkleaderboard.sh 
     sdd 100418 hours (11.45 years) 0.29TiB ST3320620AS
     sdb  95783 hours (10.92 years) 0.29TiB ST3320620AS
     sda  94252 hours (10.75 years) 0.29TiB ST3320620AS
     sdi  66276 hours ( 7.56 years) 0.45TiB ST500DM002-1BD14
     sdk  55418 hours ( 6.32 years) 2.73TiB WDC WD30EZRX-00D
     sdh  44511 hours ( 5.07 years) 0.91TiB Hitachi HUA72201
     sde  24239 hours ( 2.76 years) 0.91TiB SanDisk SDSSDH31
     sdc  17672 hours ( 2.01 years) 0.29TiB ST3320418AS
     sdf   7252 hours ( 0.82 years) 1.82TiB Samsung SSD 860
     sdj   7130 hours ( 0.81 years) 1.75TiB KINGSTON SUV5001
     sdg   1560 hours ( 0.17 years) 1.75TiB KINGSTON SUV5001

I've replaced some drives in the last 2 years and those ones, once
they started gaining reallocated sectors they didn't survive long
even though I gave them the chance. Hence the three replacements in
the last 2 years. sdc and sdd are hanging on:

$ for d in /dev/sd?; do echo -n "$d: "; sudo smartctl -A $d | grep '^  5'; done
/dev/sda:   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
/dev/sdb:   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
/dev/sdc:   5 Reallocated_Sector_Ct   0x0033   097   097   036    Pre-fail  Always       -       151
/dev/sdd:   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       5
/dev/sde:   5 Reallocated_Sector_Ct   0x0032   100   100   ---    Old_age   Always       -       0
/dev/sdf:   5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
/dev/sdg:   5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
/dev/sdh:   5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
/dev/sdi:   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
/dev/sdj:   5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
/dev/sdk:   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

At work, where's it's other people's data on the line, drives get
replaced soon as they show any defect like that, as when it does
escalate it tends to do so very quickly.

My own risk stance doesn't even permit running without redundancy
(unless inherently impossible due to the machine in question not
supporting that), because once you encounter Offline_Uncorrectable
in normal daily use it means that without redundancy, data loss has
occurred.

The drive couldn't read one or more of its sectors. If it's just
file data you can get it from backup but if, like OP here, it's
filesystem metadata then your actual filesystem is damaged and needs
fsck. And if unluckier still, whole filesystem can be broken. I'd
really rather not have to spend time on fixing that sort of thing.

> Even brand new drives are shipped with information about factory remapped
> sectors in special section inside their firmware, to cover up platter
> imperfections.

That's true, and to some extent with the densities in use today all
reading from drive is probabilistic and corrected by checksums. But
when they arrive like that they are supposed to be in a stable
state, without such errors increasing, so when they do start to
appear it is a cause for serious concern.

> This is why performing regular backups and validating them is better, I mean
> you do it all anyway, than replacing drives as soon as they get a few bad
> sectors.

I would say the two strategies are orthogonal because backups and
self-tests are advisable for everyone. Once a drive gets some
Offline_Uncorrectable the data is gone from it; backups and
self-tests didn't stop that from happening, they just helped you
recover from it (backups) or spot it early by testing even unused
areas of the drive (self-tests).

Anyway in OP's position, they have lost data which they need to
restore and while they could wait and see if the errors are
increasing in number they probably just want to get it replaced
ASAP.

Cheers,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting


Reply to: