Bug#908216: btrfs blocked for more than 120 seconds

I used btrfs-progs (4.17-1~bpo9+1) from the backports repository, which does have zstd support. btrfs-check does need zstd support. Otherwise, it will error out with unsupported feature (10).

--
Russell Mosemann

-----Original Message-----
From: "Nicholas D Steeves" <nsteeves@gmail.com>
Sent: Saturday, March 2, 2019 3:59pm
To: "Russell Mosemann" <rmosemann@futurefoam.com>, 908216@bugs.debian.org
Subject: Re: Bug#908216: btrfs blocked for more than 120 seconds

Hi Russell,

The backport of btrfs-progs is old and doesn't have libzstd support,
and I'm still waiting for a sponsor (#922951), but if you'd like to
give it a try before it hits the archive:

dget https://mentors.debian.net/debian/pool/main/b/btrfs-progs/btrfs-progs_4.20.1-2~bpo9+1.dsc

cd btrfs-progs-4.20.1/
dpkg-buildpackage -us -uc # will require the installation of build-deps
cd ../
sudo dpkg -i btrfs-progs_4.20.1-2~bpo9+1_amd64.deb \
btrfs-tools_4.20.1-2~bpo9+1_amd64.deb \
libbtrfs0_4.20.1-2~bpo9+1_amd64.deb \
libbtrfsutil1_4.20.1-2~bpo9+1_amd64.deb \
python3-btrfsutil_4.20.1-2~bpo9+1_amd64.deb

I will confess that I don't know if btrfs-check actually needs libzstd
support to check fs structures, or if --check-data-csum requires it,
but if this bug is forwarded then upstream will ask if btrfs-check
found anything using a recent version of btrfs-progs.

On Tue, Feb 26, 2019 at 11:29:25AM -0600, Russell Mosemann wrote:
> On Monday, February 25, 2019 10:17pm, "Nicholas D Steeves"
> <nsteeves@gmail.com> said:
>
> > On Mon, Feb 25, 2019 at 12:33:51PM -0600, Russell Mosemann wrote:
> >
>
> In every case, the btrfs partition is used exclusively as an archive for
> backups. In no circumstance is a vm or something like a database run on
> the partition. Consequently, it is not possible for CoW on CoW to happen.
> The partition is simply storing files.
>
[snip]
> >
> > It might be that >4.17 fixed some corner-case corruption issue, for
> > example by adding an additional check during each step of a backref
> > walk, and that this makes the timeout more frequent and severe. eg:
> > 4.17 works because it is less strict.
> >
> > By the way, is it your VM host that locks up, or your VM guests? Do[es]
> > they[it] recover if you leave it alone for many hours? I didn't see
> > any oopses or panics in your kernel logs.
>
>
>
> It is the host that locks up. This does not involve vm's in any way. If
> vm's are present, they are running on different drives. Some of the vm's
> even use btrfs partitions themselves. None of the vm's experience issues
> with their btrfs volumes. None of the vm's are affected by the hung btrfs
> tasks on the host. That is because the issue exclusively involves the
> separate, dedicated, archive partition used by the host. For all practical
> purposes, vm's aren't part of this picture.
>

Right. This simplifies things. So:

1) VM images and backup target are on different volumes on the same
host.
2) VM images are copied between volumes using "cp --reflink"
* IIRC the VM images are stored on a btrfs volume, on the host?
3) Because this is an inter-volume copy, "cp --reflink=always vhost002.img
/megaraid/backup" should emit a warning or fail.
* please confirm that it warns or fails.

I think you said that the following two things will also produce a hang?

cp --reflink /megaraid/backup/vhost002.img /megaraid/vhost002.img.0 ?

or

rm /megaraid/backup/vhost002.img.44 # oldest backup ?

Yes, Btrfs struggles with large sparse files...and these limitations
are amplified by the 45 reflinked copies, amplified when using
deduplication (I'm assuming you're not using this, btw), and amplified
by transparent compression. If this class of bugs is at fault, then
tests will fail at (5) (see below).

> As far as I am aware, a hung task does not recover, even after many hours.
> A number of times, it has hung at night. When I check in the morning hours
> later, it is still hung. In many cases, the server must be forcibly
> rebooted, because the hung task hangs the reboot process.
>

Back in the linux-3.16 to 4.4 days btrfs hung tasks were often
triggered by desktop databases such as those used by Thunderbird or
Firefox, and were quite common. My laptop always recovered after
about 15-20 minutes...

[snipped info on counting fragments and references]
> This is useful information, but it doesn't seem directly related to the
> hung tasks. The btrfs tasks hang when a file is being copied into the
> btrfs partition. No references or vm's are involved in that process. It is
> a simple file copy.
>

If the source volume is btrfs then it will have to do the complex
backref work when reading a VM.img. Because it's an intervolume cp,
and cp defaults to "--sparse=auto" /megaraid/VM.img should be written
sparsely, in one go, and reading from the resulting copy should not
have the overhead of reading from the master copy VM.img (the backup
copy should not replicate the fragmentation of the master copy when
copied in this way). This can be confirmed by comparing the filefrag
output for_each_VM.img to for_each_backup_copy.img

> > > using compression: Yes, compress-force=zstd
> > >
> >
> > If userspace CPU usage is already high then compression may introduce
> > additional latency and contribute to the >120sec warning.
>
> There is more than plenty CPU. Depending on the server, there are 12 to 16
> threads running at 2.67GHz to 3.5GHz. The servers have 64GB or more of
> memory. The servers are not loaded during the day, and the backups take
> place at night, when not much else is happening. It is difficult to
> imagine a scenario where it would take longer than 120 seconds to compress
> a block.
>

Unlike ZFS, blocks aren't compressed; large extents are:
https://btrfs.wiki.kernel.org/index.php/Compression
https://btrfs.wiki.kernel.org/index.php/Resolving_Extent_Backrefs
https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Extent_Block_Groups

> When I look at the kernel errors, they involve btrfs cleanup transactions,
> dirty blocks, caching and extents. I don't recall ever seeing a reference
> to compression in a call trace.
>

Have you installed linux-image-4.19.0-0.bpo.2-amd64 and
linux-image-4.19.0-0.bpo.2-amd64-dbg (or their unsigned variants) yet?

> > > number of snapshots: Zero
> > >
> >
> > But 45 reflinked copies per VM.
> >
> > > number of subvolumes: top level subvolume only
> > >
> >
> > I believe it was Chris Murphy who wrote (on linux-btrfs) about how
> > segmenting different functions/datasets into different
> > non-hierarchically structured (eg: flat layout) subvolumes reduces
> > lock contention during backref walks. This is a performance tuning
> > tip that needs to be investigated and integrated into the wiki
> > article. Eg:
> >
> > _____id:5 top-level_____ <-either unmounted, or
> > / | | | \ mounted somewhere like
> > / | | | \ /.btrfs-admin, /.volume, etc.
> > / | | | \
> > host_rootfs VM0 VM1 VM2 data_shared_between_VMs
>
>
>
> This is an interesting idea, but it implies that btrfs does not handle
> large files or large file systems very well. The trick is to make it look
> like multiple, small file systems.
>

Correct, the upstream development focus is mostly on stabilisation
(and to a certain extent adding new features) and not on performance.
At this time there seems to be consensus on the linux-btrfs mailing
list that ext4 or xfs should be used in preference to btrfs for
volumes holding VM images, and that nodatacow isn't that great.
"large files" is not the issue; however, VM images and databases are.

That said, storing (non-live) backup copies of VM images is a good use
case for btrfs.

[snip]

> I have carefully checked the logs for months for an explanation of why
> btrfs is hanging, and I have never seen any other error message. If this
> were one host, then a SATA reset might be in the set of possibilities, but
> since this involves multiple hosts on different architectures with and
> without RAID, a SATA reset as the only explanation for all of the hangs is
> improbable.

Agreed!

> We briefly experimented with SMR drives, and the performance was abysmal.
>

Thanks for confirming.

> > > vhost003
> > >
> > > # grep btrfs /etc/mtab
> > > /dev/sdb4 /usr/local/data/datastore2 btrfs
> > > rw,relatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
> > >
> >
> > [1] Also, why aren't you using noatime here too?
>
>
>
> noatime is a more recent change, as an experiment to determine if it would
> affect hangs. It has not been implemented on all hosts, yet. The presence
> or absence of noatime does not appear to affect hangs, which makes sense,
> because hangs happen during writes, not reads.
>

FYI, btrfs always runs better with noatime, because without this each
read will update the atime (once a day with relatime), and each atime
update will trigger a COW operation for each file that is read.

[snip]
> > > lxc008
> > >
> > > number of subvolumes: 1416
> >
> > That's *way* too many. This is a major contributing factor to the
> > timeouts...
>
>
>
> lxc008 does not experience btrfs transaction hangs with 4.17. It does
> experience hangs with 4.18 and 4.19. Those hangs happen shortly after a
> copy starts. From that perspective, a hang is easily reproducible.
>
> > [snip]
> >
> > > # grep btrfs /etc/mtab
> > > /dev/sdc1 /usr/local/data2 btrfs
> > > rw,noatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
> > >
> >
> > Is this in a container rather than a VM?
>
>
>
> lxc008 is a physical host that runs containers, rather than vm's. The
> btrfs partition is a separate partition on the RAID array. The btrfs
> partition is only used to store backup files.
>
> > > (RAID controller)
> > >
> > > # smartctl -d megaraid,0 -l scterc /dev/sdc
> > > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64]
> (local
> > > build)
> > > Copyright (C) 2002-16, Bruce Allen, Christian Franke,
> > > www.smartmontools.org
> > >
> > > Write SCT (Get) Error Recovery Control Command failed: ATA return
> > > descriptor not supported by controller firmware
> > > SCT (Get) Error Recovery Control command failed
> > >
> >
> > A different raid controller? Aiie, this is a complex setup...
>
>
>
> They are all simple setups. Either the host has a dedicated hard drive for
> the btrfs partition, or the host has a RAID array where the btrfs
> partition is located.
>

Oh... It would have been nice to know there were multiple physical
sytems early on :-/

[snip]
> > Maybe I've misunderstood, but it looks like you're running btrfs
> > volumes, on top of qcow2 images, on top of a btrfs host volume.
> > That's an easy to reproduce recipe for problems of this kind.
> >
> >
> > Sincerely,
> > Nicholas
> >
>
>
>
> This last part kind of went off the rails. We are only talking about one
> btrfs partition per physical host, which is only used to store backups. It
> is the most simple, vanilla situation, which should be perfectly suited
> for a file system.
>

This would have been nice to know at the onset... Please pick one
host without a megaraid for the purposes of this bug. I'll assume
that one host is not in production and can afford to crash, that you
have external backups, and (optionally) that you can add a SAS or SATA
connected hard drive.

1)
Install the newest bpo kernel and its dbg package on this machine,
along with the locally-compiled btrfs-progs 4.20.x bpo (if you prefer
I can upload it somewhere).

Add "scsi_mod.use_blk_mq=0" as a kernel argument and update-grub, or
manually edit at the grub menu. This removes the new blk-mq as a
variable.

Add a new disk, partition it, format it using btrfs-progs 4.20.x, and
mount without compression. This removes compression as a variable,
which is notable, because most corruption or crash reports on
linux-btrfs involve transparent compression. It also removes ancient
file systems structures as a variable. Yes, really. There is a large
class of reports on linux-btrfs that are caused by the lint of various
allocation bugs in past kernels. (a)

To isolate the kernel reader from the writer, netcat (or ssh) a VM
image from another machine.

If it crashes then, we'll have a simple reproducible case for core
btrfs functionality. I expect this test case to pass.

2)
Try again with scsi_mod.use_blk_mq=1. If this fails the bug is due to
a blk-mq bug, or malinteraction between blk-mq and btrfs. I'm not
sure if this one will pass or fail. If it fails then this bug might
be a duplicate of #913138.

3)
Try again with scsi_mod.use_blk_mq=0, but this time mount with
compress=zstd. This test will fail if it's a zstd compressed extent
bug.

4) Try again with scsi_mod.use_blk_mq=1 (kernel default), with
compress=zstd. This is the closest to your existing test case. It
will fail, unless (a) is the cause.

5) If for some reason all of 1-to-4 pass, then redo them without
isolating the reader and the writer on different machines. This
probably won't be necessary.

--

If everything passes, then try 1, 2, and maybe 3 on a one of the
systems with a megaraid to test for a malinteraction between this HBA
and blk-mq--I think it's highly unlikely it will come to this!

Cheers,
Nicholas