Bug#964494: File system corruption with ext3 + kernel-4.19.0-9-amd64

To: Ben Hutchings <ben@decadent.org.uk>, 964494@bugs.debian.org
Subject: Bug#964494: File system corruption with ext3 + kernel-4.19.0-9-amd64
From: Sarah Newman <srn@prgmr.com>
Date: Wed, 8 Jul 2020 15:36:22 -0700
Message-id: <[🔎] 2c557cee-5b88-d1b8-88ae-f134dc1f4c22@prgmr.com>
Reply-to: Sarah Newman <srn@prgmr.com>, 964494@bugs.debian.org
In-reply-to: <[🔎] 8d363cf9f810603f57dcc04dd8c4fdf573e6bb28.camel@decadent.org.uk>
References: <[🔎] 2715fe75-51c6-f61a-0d2c-2324d35aee03@prgmr.com> <[🔎] 8d363cf9f810603f57dcc04dd8c4fdf573e6bb28.camel@decadent.org.uk> <[🔎] 2715fe75-51c6-f61a-0d2c-2324d35aee03@prgmr.com>

On 7/7/20 8:13 PM, Ben Hutchings wrote:

Control: reassign -1 src:linux
Control: tag -1 moreinfo

On Tue, 2020-07-07 at 17:30 -0700, Sarah Newman wrote:

Package: linux-signed-amd64
Version: 4.19.0-9-amd64

We've had two separate reports now of debian buster users running
4.19.0-9-amd64 who experienced serious file system corruption.


Which version?  (I.e. what does "uname -v" or
"dpkg -s linux-image-4.19.0-9-amd64" say?)


One is version: 4.19.118-2+deb10u1

- Both were using ext3
- Both are running Xen HVM, but I do not have reason to believe this to be related


I have no reason to assume that this is unrelated to the hypervisor, so
please report the version of Xen and whatever provides the back-end
block driver.

For the failures there are two different Xen hypervisor versions involved, the most recent being 4.9.4.45.g8d2a6880, with various patches for securityissues applied.

For Linux, the base version is 4.9.197. That's missing the xen blockback patches "xen/blkback: Avoid unmapping unmapped grant pages" and "xen-blkback:prevent premature module unload" but I don't think either of those are relevant here based on the descriptions for those patches.


Nothing in the backend has been updated within the last few weeks.

We believe that we have positively identified around 90 VMs running Debian Buster under the same backend versions, though we can't say for certainwhat kernel version or file system. I would guess at least 15 of them to be running ext4 + linux-image-4.19.0-8-amd64/4.19.98-1 or later.


Some of our own test systems on the exact same kernel and hypervisor as the ones with failures are running:

4.19.0-5-amd64 #1 SMP Debian 4.19.37-5+deb10u2 (2019-08-08) x86_64 GNU/Linux
4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20) x86_64 GNU/Linux

They are running on top of ext4, with file system options:

rw,relatime,nobarrier,errors=remount-ro,stripe=XX

They are not heavily loaded, so if load is related then they would not exhibit issues.

What we don't have personal knowledge of is 4.19.0-9-amd64, either with or without ext3.

Normally we would gather more data before making an upstream report, but given the severity I thought best to do this sooner rather than later.

- Both are on distinct physical hosts
- Both had upgraded from an older non 4.19 kernel within the last two or three weeks


 From which older versions?


In one case:

"the upgrade was from Debian 9 Stretch and the system was up to date before running the upgrade."

For the other, linux-image-4.9.0-11-amd64.

One user had the error:

ext4-fs error (device xvda1): ext4_validate_block_bitmap:393: comm cat: bg 812: block 26607617: invalid block bitmap
aborting journal on device xvda1-8
ext4-fs error (device xvda1): ext4_journal_check_start:61: Detected abnormal journal
ext4-fs (xvda1): Remounting filesystem read-only
ext4-fs (xvda1): Remounting filesystem read-only
ext4-fs error (device xvda1) in ext4_orphan_add:2863: Journal has aborted


And were there any other error messages, e.g. relating to I/O errors,
around the same time?  How about in the back-end domain?


For the backend, I do not see errors around that time or for several weeks previous on either physical host.

One user, the one who gave us that report, reports no other errors.  They say:

After the live recovery fsck completed, I was able to use the partition and it reported clean, but it was clearly still pretty damaged. Grub2 forexample wouldn't install, insisting unknown filesystem. I copied all the data to a new ext4 filesystem and was able to boot into that, but later sawthere was pretty significant file corruption, including files that had not been modified in weeks or months. PHP files had random strings inserted inthem. Debsums reported probably 10% of packages having invalid sumchecks in some of the installed files. And a few mysql database tables hadcorruption. I was able to restore the database and replace pretty much everything else from backups that had been made about 10 minutes prior to thefilesystem corruption, and then re-installed every package. So far things seem to be working fine, since I've more or less replaced every file.


The other user I am not sure about.

The other gave us the output of tune2fs -l:

[...]

Looks like a fairly ordinary ext3 filesystem.  It doesn't tell us
anything about what went wrong.

In general I would advise against continued use of the ext3 format.  It
should continue to be supported by the ext4 code, but it is inevitably
going to be less well-tested than the ext4 format.  So far as I can
remember, it is easy to upgrade in-place.


Thank you. One user has already converted to ext4 and the other plans to.

--Sarah

Reply to:

References:
- Bug#964494: File system corruption with ext3 + kernel-4.19.0-9-amd64
  - From: Sarah Newman <srn@prgmr.com>
- Bug#964494: File system corruption with ext3 + kernel-4.19.0-9-amd64
  - From: Ben Hutchings <ben@decadent.org.uk>

Prev by Date: Arch qualification for buster: call for DSA, Security, toolchain concerns
Next by Date: Bug#964499: Also see bug # 964500
Previous by thread: Bug#964494: File system corruption with ext3 + kernel-4.19.0-9-amd64
Next by thread: Bug#964494: File system corruption with ext3 + kernel-4.19.0-9-amd64
Index(es):
- Date
- Thread