[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: SOLVED Re: Disk corruption and performance issue.



On 2/26/24 13:25, Tim Woodall wrote:
TLDR; there was a firmware bug in a disk in the raid array resulting in
data corruption. A subsequent kernel workaround resulted in
dramatically reducing the disk performance. (probably just writes but I
didn't confirm)


Initially, under heavy disk load I got errors like:

Preparing to unpack .../03-libperl5.34_5.34.0-5_arm64.deb ...
Unpacking libperl5.34:arm64 (5.34.0-5) ...
dpkg-deb (subprocess): decompressing archive '/tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb' (size=4015516) member 'data.tar': lzma error: compressed data is corrupt
dpkg-deb: error: <decompress> subprocess returned error exit status 2
dpkg: error processing archive /tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb (--unpack): cannot copy extracted data for './usr/lib/aarch64-linux-gnu/libperl.so.5.34.0' to '/usr/lib/aarch64-linux-gnu/libperl.so.5.34.0.dpkg-new': unexpected end of file or stream

The checksum will have been verified by apt during the download but when
it comes to read the downloaded deb to unpack and install it doesn't get
the same data. The corruption can happen at both the writing (the file
on disk is corrupted) and the reading (the file on disk has the correct
checksum)


A second problem I got was 503 errors from apt-cacher-ng (which ran on
the same machine as the above error)



I initially assumed this was due to faulty memory, or possibly a faulty
CPU. But I assumed memory because the disk errors were happening in a VM
and no other VMs were affected. Because I always start the same VMs in
the same order I assumed they'd be using the same physical memory each
time.

However, nothing I could do would help track down where the memory
problem was. Everything worked perfectly except when using the disk
under load.

At this time I spent a significant amount of time migrating everything
important, including the big job that triggered this problem, off this
machine onto the pair. After that the corruption problems went away but
I continued to get periodic 503 errors from apt-cacher-ng.


I continued to worry at this on and off but failed to make any progress
in finding what was wrong. The version of the motherboard is no longer
available otherwise I'd probably have bought another one. During this
time I also spent quite a lot of time ensurning that it was much easier
to move VMs between my two machines. I'd underestimated how tricky this
would be if the dodgy machine failed totally which I became aware of
when I did migrate the VM having problems.


Late last year or early this year someone (possibly Andy Smith?) posted
a question about logical/physical sector sizes on SSDs. That set me off
investigating again as that's not something I'd thought of. That didn't
prove fruitful either but I did notice this in the kernel logs:

Feb 17 17:01:49 xen17 vmunix: [    3.802581] ata1.00: disabling queued TRIM support Feb 17 17:01:49 xen17 vmunix: [    3.805074] ata1.00: disabling queued TRIM support


from libata-core.c

  { "Samsung SSD 870*",  NULL, ATA_HORKAGE_NO_NCQ_TRIM |
       ATA_HORKAGE_ZERO_AFTER_TRIM |
       ATA_HORKAGE_NO_NCQ_ON_ATI },

This fixed the disk corruption errors at the cost of dramatically
reducing performance. (I'm not sure why because manual fstrim didn't
improve things)


At this point I'd discovered that the big job that had been regularly
hitting corruption issues now completed. However, it was taking 19 hours
instead of 11 hours.

I ordered some new disks - I'd assumed both disks were affected but
while writing this I notice that that "disabling queued TRIM support"
prints twice for the same disk, not once per disk.

I thought one of these was my disk but looking again now I see I had
1000MX500 which doesn't actually match.

  { "Crucial_CT*M500*",  NULL, ATA_HORKAGE_NO_NCQ_TRIM |
       ATA_HORKAGE_ZERO_AFTER_TRIM },
  { "Crucial_CT*MX100*",  "MU01", ATA_HORKAGE_NO_NCQ_TRIM |
       ATA_HORKAGE_ZERO_AFTER_TRIM },

While waiting for my disks I started looking at the apt-cacher-ng
503 problem - which has continued to bug me. I got lucky and discovered
a way I could almost always trigger it.

I managed to track that down to a race condition when updating the
Release files if multiple machines request the same file at the same
moment.

After finding a fix I found this bug reporting the same problem:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1022043

There is now a patch attached to that bug that I've been running for a
few weeks without a single 503 error.

And Sunday I replaced the two disks with new ones. Today that big job
completed in 10h15m.

Another thing I notice, although I'm not sure I understand what is going
on, is that my iscsi disks all have
            Thin-provisioning: No

This means tha fstrim on the vm doesn't work. Switching them to Yes and
it does. So I'm not exactly sure where the queued trim was coming from
> in the first place.


Are you using systemd ?

/etc/systemd/system/timers.target.wants/fstrim.timer

[Unit]
Description=Discard unused blocks once a week
Documentation=man:fstrim
ConditionVirtualization=!container
ConditionPathExists=!/etc/initrd-release

[Timer]
OnCalendar=weekly
AccuracySec=1h
Persistent=true
RandomizedDelaySec=6000

[Install]
WantedBy=timers.target

You should not be running trim in a container/virtual machine

Here is some info: https://wiki.archlinux.org/title/Solid_state_drive



Reply to: