[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1037223: Possible bug causing I/O hangs



Control: tags -1 + moreinfo

Hi Niels,

On Thu, Jun 08, 2023 at 11:33:13AM +0200, Niels Hendriks wrote:
> Package: linux-image-amd64
> Version: 5.10.178-3

>From the screenshot I guess you mean 5.10.179-1, or possibly already
in 5.10.178-3?
> 
> 
> Hi all,
> 
> I do not usually report kernel bugs so hopefully this is the right
> place!
> 
> We recently updated the kernel of our Debian 11 servers and since
> then we have encountered a bunch of servers (both VMs and bare
> metal) that suffer I/O hanging issues.
> We can access the server through a console where I cannot copy text,
> but I have attached a screenshot showing the message we see in
> dmesg.
> 
> We initially thought this was related to the ext4 fast_commit
> feature flag we have enabled, and we do feel the issue occurs less
> often with fast_commit disabled, but it does not appear to be solved
> completely when we disable this feature.
> 
> With this error, we've been googling a bit and I ended up on this
> thread: https://www.spinics.net/lists/linux-ext4/msg86261.html
> through initially https://github.com/flatcar/Flatcar/issues/847 It
> mentions this
> fix: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/fs/ext4?h=linux-5.15.y&id=5bc0b2fda4b47c86278f7c6d30c211f425bf51cf
> I believe this fix is currently not present in the 5.10 kernel
> available for Debian 11.

That commit is upstream commit
a44e84a9b7764c72896f7241a0ec9ac7e7ef38dd, which was backported to
various stable series, in particular 5.10.163 with
1be16a0c2f10186df505e28b0cc92d7f3366e2a8 .
> 
> However, the linked fix also mentions:
> > This bug has been around for many years, but it became *much* easier
> to hit after commit 65f8b80053a1 ("ext4: fix race when reusing xattr
> blocks").
> 
> Looking at the
> changelog: https://metadata.ftp-master.debian.org/changelogs//main/l/linux-signed-amd64/linux-signed-amd64_5.10.178+3_changelog
> We do see the "ext4: fix race when reusing xattr blocks" change
> being added in 5.10.178-1.  This is why we believe we are now
> hitting this bug.
> 
> My question is whether this seems plausible, and if so, whether the
> fix I linked can also be released for Debian 11?

Right now I do not see that to be the cause, as the above mentioned
commit *is* in the version, unless I'm missunderstanding.
> 
> We could also upgrade to the bullseye-backports kernel, but given
> that this issue makes the system essentially unusable and we hit it
> every few days on one of our servers it may be more widespread and
> worth it to fix it in the regular bullseye kernel as well.

Do you had a 5.10.y kernel which was fine, and can you bisect the
changes between that version and 5.10.179 to pin point the first bad
commit causing the issue?

If your infrastructure is prepared to do so, next steps might involve
trying the most recent 5.10.y kernel to see if it still exhibit the
problem, then going up to newer stable series and/or mainline.

Please in particular test the current 5.10.182 upstream as it has
interesting ext4 related changes between 5.10.179 and 5.10.182.

Regards,
Salvatore


Reply to: