Bug#861964: marked as done (Increased ext4_inode_cache size wastes RAM under default SLAB allocator)

To: Ben Hutchings <ben@decadent.org.uk>
Subject: Bug#861964: marked as done (Increased ext4_inode_cache size wastes RAM under default SLAB allocator)
From: "Debian Bug Tracking System" <owner@bugs.debian.org>
Date: Thu, 22 Feb 2018 16:18:15 +0000
Message-id: <[🔎] handler.861964.D861964.15193162287698.ackdone@bugs.debian.org>
Reply-to: 861964@bugs.debian.org
References: <1519316199.2617.234.camel@decadent.org.uk> <HE1PR0801MB149922745FF2C74B22FFB265A5E80@HE1PR0801MB1499.eurprd08.prod.outlook.com>

Your message dated Thu, 22 Feb 2018 17:16:39 +0100
with message-id <1519316199.2617.234.camel@decadent.org.uk>
and subject line Re: Increased ext4_inode_cache size wastes RAM under default SLAB allocator
has caused the Debian Bug report #861964,
regarding Increased ext4_inode_cache size wastes RAM under default SLAB allocator
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@bugs.debian.org
immediately.)


-- 
861964: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=861964
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems

--- Begin Message ---

To: "submit@bugs.debian.org" <submit@bugs.debian.org>
Subject: Increased ext4_inode_cache size wastes RAM under default SLAB allocator
From: Laurence Parry <greenreaper@hotmail.com>
Date: Sat, 6 May 2017 15:33:09 +0000
Message-id: <HE1PR0801MB149922745FF2C74B22FFB265A5E80@HE1PR0801MB1499.eurprd08.prod.outlook.com>

Package: linux-image-amd64
Version: 4.9+80

Debian's use of the SLAB allocator combined with ongoing kernel changes mean
the ext4 inode cache wastes ~21% of space allocated to it on recent amd64
kernels, a regression from the ~2% waste in jessie.

SLAB enforces a first-order allocation (i.e. 4KB on x86[-64]) for slabs
containing VFS-reclaimable objects such as ext4_inode_info:
http://elixir.free-electrons.com/linux/v4.9.25/source/mm/slab.c#L1827

In jessie's Linux 3.16 kernel, an ext4_inode_cache entry is ~1000 bytes, so
four fit nicely in a slab. Additions to this structure and its members have
increased it to ~1072 bytes in 4.9.25 (on a machine with 32 logical cores):

# grep ext4_inode_cache /proc/slabinfo
name <active_objs> <num_objs> <objsize> <objperslab>
ext4_inode_cache 956 987 1072 3 …

…leaving 880 bytes wasted per slab in Debian stretch (and jessie-backports).

Having 3 objects vs. 4 per slab may reduce internal fragmentation, but
inodes can't linger for as long, and creating them evicts data, leading to
increased disk activity. Slab cache allocation takes time; and if the slabs
were denser, more inodes (or other content) could fit in CPU cache.

By comparison, mainline's default SLUB allocator (used by Ubuntu) seems to
use a 4 page/16KB or 8 page/32 KB slab size, which fits 15/30
ext4_inode_cache objects. This has also decreased since 3.16, but it is not
as wasteful.

Inode cache size is initially small, but may grow to ~50% of RAM under heavy
workloads, e.g. fileserver rsync.

== Possible workarounds/resolutions ==

A custom-compiled kernel with the right options reduces ext4_inode_cache
object size below 1000 bytes - for me, it cut ~160MB from slab_cache on an
active 32GB web app/file server with nightly rsync. (It may reduce CPU and
disk utilization, but the load in question is not constant enough to
benchmark.)

Some flags have a big impact on ext4_inode_info
(and subsidiary structs such as rw_semaphore):
http://elixir.free-electrons.com/linux/v4.9.25/source/fs/ext4/ext4.h#L937

The precise sizes change with kernel version and CPU configuration. For
jessie-backports' Linux 4.7.8, disabling both
* EXT4 encryption (CONFIG_EXT4_FS_ENCRYPTION) _and_ either:
a) VFS quota (CONFIG_QUOTA; OCSFS2 must be disabled first), or
b) Optimistic rw_semaphore spinning (CONFIG_RWSEM_SPIN_ON_OWNER)
reduced ext4_inode_cache objects to 1008-1016 bytes; sufficient to fit four
inodes in a slab. It worked on 4.8.7 as well, reducing size to exactly 1024.

But custom compilation is time-consuming and workload-dependent. Tossing
ext4 encryption and quota is fine for our purposes, but Debian may not want
to.

Disabling optimistic semaphore owner spinning - perhaps under a certain
number of cores? - may be part of a general solution; there's no menu option
for CONFIG_RWSEM_SPIN_ON_OWNER, so it has to be set in the build config, or
possibly on the command line.

https://lkml.org/lkml/2014/8/3/120 suggests optimistic improves some
contention-heavy workloads - or at least benchmarks thereof - but it may not
be worth the trade-off by default. Incidentally, I found zero documentation
that this may negatively impact memory usage.

Getting into more significant code changes: Ted Ts'o shrunk ext4_inode_info
by 8% six years ago:
http://linux-ext4.vger.kernel.narkive.com/D3sK9Flg/patch-0-6-shrinking-the-size-of-ext4-inode-info

…but it has since grown ~22%, due to features such as ext4 encryption,
project-based quota, and the aforementioned optimistic spinning on the three
read-write semaphores in the struct:
https://github.com/torvalds/linux/commit/4fc828e24cd9c385d3a44e1b499ec7fc70239d8a
https://github.com/torvalds/linux/commit/ce069fc920e5734558b3d9cbef1ab06cf01ee793
https://lwn.net/Articles/697603/

Ted mentioned that "it would be possible to further slim down the
ext4_inode_cache by another 100 bytes or so, by breaking the ext4_inode_info
into the portion of the inode required [when] a file is opened for writing,
and everything else."

This might be worth it, given that we're on the borderline, and particularly
if rw_semaphore is included; there are attempts to make those even bigger:
http://lists-archives.com/linux-kernel/28643980-locking-rwsem-enable-count-based-spinning-on-reader.html

Adding an define to configure out project quota (kprojid_t i_projid) may cut
a few bytes - or maybe more given alignment? I don't know if this would have
a negative impact on filesystems which used them, other than the feature not
working. At least it would give another knob to tweak.

Adjusting struct alignment may also be beneficial, either in all cases or
based on the presence/absence of flags, as in
https://patchwork.ozlabs.org/patch/62051/

ext4_inode_info appears to contain a copy of the 256-byte on-disk format.
Maybe it's feasible to use some of this in-place rather than duplicating it
and writing it back later? Or it could be separated into its own object;
it's a nice round size. (In-place use may violate style guidelines, if
nothing else…)

Lastly, 32-bit and uniprocessor kernels have far smaller ext4_inode_cache -
I got one down to 560 bytes (7 obj/slab) - and may remain beneficial where
RAM is strictly limited (VMs in particular).

== SLAB vs. SLUB ==

Debain's use of SLAB allocation (vs. SLUB) might also be reconsidered. But
I'm not sure this is as useful as just reducing the inode size.

Both allocators appear to have improved over time (e.g. SLAB got 1-byte
freelist entries). If anything, SLAB has had more work recently.

The view in 2012 appeared to be that SLUB was less-suitable for
multiprocessor systems than SLAB:
https://lists.debian.org/debian-kernel/2012/03/msg00944.html

And while Linus seems to want to get rid of SLAB:
http://marc.info/?l=linux-mm&m=147423350524545&w=2

... it seems SuSE also still uses it:
http://marc.info/?l=linux-mm&m=147426644529856&w=2

In fact this problem might have been avoided with SLAB because it would have
soaked up 4K blocks.
http://marc.info/?l=linux-mm&m=147422898523307&w=2

Reducing structure size would benefit every allocator, so that should
probably be the focus.
--
Laurence "GreenReaper" Parry - Inkbunny administrator
greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net
"Eternity lies ahead of us, and behind. Have you drunk your fill?"

--- End Message ---

--- Begin Message ---

To: 861964-done@bugs.debian.org

Subject: Re: Increased ext4_inode_cache size wastes RAM under default SLAB allocator

From: Ben Hutchings <ben@decadent.org.uk>

Date: Thu, 22 Feb 2018 17:16:39 +0100

Message-id: <1519316199.2617.234.camel@decadent.org.uk>

In-reply-to: <HE1PR0801MB149922745FF2C74B22FFB265A5E80@HE1PR0801MB1499.eurprd08.prod.outlook.com>

References: <HE1PR0801MB149922745FF2C74B22FFB265A5E80@HE1PR0801MB1499.eurprd08.prod.outlook.com>
Version: 4.15.4-1

We've switched to using SLUB in unstable.

Ben.

-- 
Ben Hutchings
[W]e found...that it wasn't as easy to get programs right as we had
thought. ... I realized that a large part of my life from then on was
going to be spent in finding mistakes in my own programs. - Maurice
Wilkes, 1949
Attachment: signature.asc
Description: This is a digitally signed message part

--- End Message ---

Reply to:

Prev by Date: Re: SLAB vs. SLUB
Next by Date: Processed: forcibly merging 891036 891039
Previous by thread: Processed: reassign 689962 to src:linux, found 689962 in linux/3.2.30-1, reassign 692646 to src:linux ...
Next by thread: Processed: forcibly merging 891036 891039
Index(es):
- Date
- Thread