Bug#987069: document which file systems support cgroupv2 I/O controller

To: Nicholas D Steeves <nsteeves@gmail.com>
Cc: Paul Gevers <elbrus@debian.org>, 987069@bugs.debian.org, "Theodore Y. Ts'o" <tytso@debian.org>
Subject: Bug#987069: document which file systems support cgroupv2 I/O controller
From: "Theodore Ts'o" <tytso@mit.edu>
Date: Thu, 13 May 2021 23:04:14 -0400
Message-id: <[🔎] YJ3orqIsntxrwTiX@mit.edu>
Reply-to: "Theodore Ts'o" <tytso@mit.edu>, 987069@bugs.debian.org
In-reply-to: <[🔎] 87zgwy4bet.fsf@DigitalMercury.dynalias.net>
References: <161861255427.3299.13258728786458708486.reportbug@DigitalMercury.dynalias.net> <b6ed2229-013f-4a2b-2965-a4f77753e2cd@debian.org> <[🔎] 87zgwy4bet.fsf@DigitalMercury.dynalias.net> <161861255427.3299.13258728786458708486.reportbug@DigitalMercury.dynalias.net>

On Thu, May 13, 2021 at 04:38:18PM -0400, Nicholas D Steeves wrote:
> >
> > On 17-04-2021 00:35, Nicholas D Steeves wrote:
> >> Last I checked, only btrfs supports the cgroupv2 I/O controller; this
> >> should probably be documented. Alternatively, if more than btrfs (ie:
> >> XFS and ext4) supports it, but not other file systems (ie: f2fs,
> >> reiser4, jfs, etc) than this should be documented.

There are two things which are being confused here. Once is whether
you are using cgroup v1 versus cgroup v2, and the other thing is which
i/o related cgroup controller are you using.

There is the I/O cost model based controller (CONFIG_BLK_CGROUP_IOCOST)
and the I/O latency controller (CONFIG_BLK_CGROUP_IOLATENCY). These are
two *different* I/O controllers that are supported by cgroup v2.

The I/O Cost controller is simpler, and is similar to the cgroup v1
block I/O controller, although it's more complex/featureful.

The I/O latency is experimental (from the block/Kconfig: "Note, this
is an experimental interface and could be changed someday.") and since
it was implemented by Josef Bacik, who is one of the maintainers of
btrfs, it was only tested on btrfs (and Facebook workloads).

Now, there is nothing specifically file system in any of the I/O
controllers, whether we're talking about the cgroup v1 blkio
controller, the cgroup v2 I/O cost controller, or cgroup v2 I/O
latency, but it turns out I/O controllers are *subtle*. How they
interact memory cgroups, and file systems ends up having all sorts of
"interesting" interactions.

Solving this problem in a general fashion takes a lot of software
engineering investment, and I know of only two companies who have done
that dedicated work is (a) Google, where we made cgroup v1 block I/O
controller work with ext2 in no-journal mode, although we needed to
make enough changes to make things work well at the high-levels of
memory, cpu, and I/O loads with fine-grained control, we needed to
make changes to the cfq I/O scheduler that were rejected by upstream
because they were "too complicated" and "insufficiently general", so
we ended up forking cfq to create an internal gfq I/O scheduler that
we've been carrying for the last 8 years or so, and (b) Facebook,
where Josef basically created his own I/O controller, but since it was
new and experimental, he only got it working for Facebook's
configuration, got it upstream, and he hasn't done much with it since
2018.

I'm not saying this to diss on Josef; it's a hard problem, and
speaking from experience, solving it takes a particular company's
workload can easily be 1-3 focused SWE-years --- and the business case
to make that investment more generally hasn't really existed.

So why is it that the I/O latency controller has been tested to work
well only on btrfs? Two reasons. The first is that the file system
has to mark its metadata I/O using the REQ_META flag, so that the
metadata doesn't get throttled. Why is this important? Because
metadata I/O often happens while holding locks, or happens on kernel
threads located in a different (or system) cgroup, on behalf of
processes in a different cgroup. So the I/O Latency controller
exempts from throttling I/O requests that are marked with the REQ_META
or REQ_SWAP (for swap requests). Btrfs and ext4 marks metadata I/O
with REQ_META[1]; XFS does not.

[1] REQ_META is useful even before the I/O latency controller was
introduced, since it allows you to easily identify Metadata I/O using
block tracing tool, or if you wanted to collect timing info of
metadata I/O using eBPF, etc.

The other issue is that ext4 will do synchronous (blocking) data
writebacks as part of the commit processing in order to make sure
freshly allocated blocks won't accidentally reveal previously deleted
data after an crash or power failure. So if the I/O Latency
controller throttles I/O's happening inside the ext4 commit thread, it
will slow down the commit, and this slows down *all* threads. And if
we exempt writebacks from the commit thread, a large enough percentage
of writebacks can end up getting exempted that the I/O latency
controller doesn't work very well at all.

There is nothing *stopping* you from using the I/O latency controller
with ext4. However, it may not work well, and you may be very unhappy
with the results. Actually, I suspect that if you mounted an ext4
file system without a journal ("mkfs.ext4 -O ^has_journal" which is
how Google uses ext4 in our data center servers) or with
data=writeback, the I/O latency controller would work *fine*.
However, it's not been tested, and so if it breaks, you get to keep
both pieces.

Of course, running ext4 without a journal that file systems can get
corrupted after a crash. Given how Google uses ext4 in our data
centers, where it's the back-end store for a cluster file system where
we are using erasure coding (e.g., Reed-Solomon encoding) or n=3
replication, the speed benefits are worth it, and we have other ways
of protecting against single node failures --- either due to fs
corruption after a power failure, or the network router at the top of
the rack failing, or a power distribution unit failure taking out an
entire row of racks of servers. And data=writeback also has
tradeoffs:

"A crash+recovery can cause incorrect data to appear in files which
were written shortly before the crash."
- https://www.kernel.org/doc/html/latest/admin-guide/ext4.html#data-mode

> I'm also not sure, because everywhere I've looked appears to document
> that this technology is not filesystem specific (except the Facebook
> cgroupv2 iogroup announcement which asserts btrfs-only).
>
> CCing Theodore Ts'o, who will definitely know if ext4 is supported!

As described above, under *some* circumstances, ext4 *may* work with
I/O latency. But until you test it, you won't know for sure.

So more generally, that's the problem with I/O controllers in general.
They are fundamentally oblivious to how they interact with other parts
of the kernel, and until you do the testing, and perform any
remediation that you may find is necessasry, you may end up being
surprised when you put it into production use. Josef did some brief
testing with XFS and ext4, and noted that it didn't work in their
default configuration, and he did whatever work was needed to make it
work well for Facebook using btrfs.

I'll give you another example that we learned the hard way. Depending
on how tight you make your memory cgroups and how tight you constrain
your I/O controller, it's possible for write throttling --- where
processes which are dirtying memory faster than they can be written
out are put to sleep instead of triggering the OOM killer. It turns
out write throttling when the total system memory is low is quite
different from a particular memory cgroup is low on free memory, and
so the complex interactions between the memory cgroup controller and
and the I/O cgroup controller is another reason why there appears to
be a guaranteed employment act for data center kernel engineers. :-)

So even with btrfs, how you configure your system may be quite
different from how Facebook configures its data center servers, so I'd
encourage you to be careful, and do a lot of careful testing before
you deploy the I/O latency cgroup in production, It's not that Block
I/O latency controller won't work with any particular file system ---
the problem is that it may work too well, or at least not the way you
expect, ala the magic broomstick carrying water in Disney's
Fantasia[2].

[2] https://video.disney.com/watch/sorcerer-s-apprentice-fantasia-4ea9ebc01a74ea59a5867853

Cheers,

- Ted

P.S. As far as other file systems (f2fs, jfs, reiserfs, etc.) the
same issues apply; if they don't mark metadata I/O with the REQ_META
flag, things may not go well. If they do, the next question is
whether they are doing a lot of I/O on kernel threads on behalf of
various userspace processes. But until you actually test and verify,
I would hesitate to make any kind of guarantee.

Reply to:

Follow-Ups:
- Bug#987069: document which file systems support cgroupv2 I/O controller
  - From: "Theodore Ts'o" <tytso@mit.edu>
- Bug#987069: document which file systems support cgroupv2 I/O controller
  - From: Paul Gevers <elbrus@debian.org>

References:
- Bug#987069: document which file systems support cgroupv2 I/O controller
  - From: Nicholas D Steeves <nsteeves@gmail.com>

Prev by Date: Bug#987069: document which file systems support cgroupv2 I/O controller
Next by Date: Bug#988078: release-notes: add information regarding exim4 and 'tainted data' issue
Previous by thread: Bug#987069: document which file systems support cgroupv2 I/O controller
Next by thread: Bug#987069: document which file systems support cgroupv2 I/O controller
Index(es):
- Date
- Thread