[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: kernel configs in Debian



On Fri, Apr 30, 2021 at 4:10 AM Ryutaroh Matsumoto
<ryutaroh@ict.e.titech.ac.jp> wrote:
>
> This is a followup for my previous post of impact on kernel performance
> by kernel comile options:
>
> Summary:
> * CONFIG_PARAVIRT=n has probably no positive impact on either
>   linux-image-arm64 or linux-image-rt-arm64.

Ok

> * CONFIG_DEBUG_PREEMPT=n much improves performance of linux-image-rt-arm64,
>   while it is unselectabe with linux-image-arm64 as CONFIG_DEBUG_PREEMPT
>   depends on CONFIG_PREEMPTION.
>
> * linux-image-rt-arm64 is much slower than the standard linux-image-arm64,
>   but its performance probably becomes comparable by omitting unnecessary
>   compile options for a given hardware.

I would not expect any change in performance from omitting unused drivers.
If turning off the other platforms has a performance impact, this could still
mean that there is a serious performance regression where we do not
expect it.

CONFIG_DEBUG_PREEMPT is a tough choice here: in a distro kernel,
this should probably be enabled since it may find RT specific bugs in
arbitrary drivers. Generally speaking, PREEMPT_RT is less well tested
than normal kernels, so having this enabled is particularly useful when
running on hardware that nobody else has tried it on before.
The impact of CONFIG_DEBUG_PREEMPT is also higher than I expected
here, it may be worth asking on the linux-rt-users list about what the
expected cost on arm64 hardware is.

> The job of RPi4B is taking IPv4 packets, applying NAPT, encapslating them in IPv6,
> and vice versa. Almost no user process is involved. CPU is mainly in the kernel
> mode or interrupt. The cpu consumption of hard irq + softirq of single cpu core
> spikes to 85 to 100% during the speedtest.

This is likely all driver specific, and if you just need to improve network
throughput, tuning or hacking the driver probably makes more difference
than the kernel.

If this is the internal network device in the Raspberry Pi 4, I can see
that the platform is not particularly optimized for throughput, even
though the driver doesn't contain any serious blunders.

The first thing I see is that the driver can support 40 bit addressing,
but the platform doesn't declare the bus to be wider than 32 bits,
so it will always use bounce buffers for any address above the first
four gigabytes. Interestingly, the DTB file that comes with raspbian
does declare a /scb/dma-ranges property for the bus that ethernet
and PCI are attached to, which would make their kernel much
faster than a mainline kernel!

Another thing I see is that the ethernet device is actually able to
use four separate transmit queues, but it seems they are all
wired up the same interrupt line.  For rx queues, the hardware
does seem to support it but the driver doesn't. I doubt that there
is anything you can do about this to make it use multiple CPUs.

Finally, I see that the TX queue is protected using a spinlock that
prevents the bcmgenet_xmit() function from running concurrently
with the __bcmgenet_tx_reclaim() function, so even when you
call xmit on a different CPU cores, it still won't utilize multiple cores
at any time, but rather lead to either spinning (with the normal
kernel) or blocking the thread (on a rt kernel). If the transmit
path can be changed to work without spinlocks, the differences
between rt and and non-rt would get smaller for your workload,
and probably faster in both cases.

> The observed speeds are shown below:
>
> linux-image-arm64 with no change:
>    Download:   577.23 Mbps (data used: 370.7 MB)
>      Upload:   386.99 Mbps (data used: 353.0 MB)
>    Download:   592.79 Mbps (data used: 1.1 GB)
>      Upload:   380.41 Mbps (data used: 171.0 MB)
>
>
> linux-image-arm64 with CONFIG_PARAVIRT=n
>    Download:   485.35 Mbps (data used: 406.0 MB)
>      Upload:   380.57 Mbps (data used: 171.5 MB)
>    Download:   514.57 Mbps (data used: 256.8 MB)
>      Upload:   376.92 Mbps (data used: 169.2 MB)

Curiously, these numbers suggest that turning off CONFIG_PARAVIRT
actually makes the kernel slower in the non-preempt version, while for
the preempt-rt kernel it does not show that counterintuitive effect.
Can you check whether there are any other differences in the .config
file besides CONFIG_PARAVIRT that may cause the difference, and
that you didn't mix up the results?

> linux-image-rt-arm64 with no change:
>    Download:   380.85 Mbps (data used: 422.2 MB)
>      Upload:   283.87 Mbps (data used: 127.8 MB)
>
> linux-image-rt-arm64 with CONFIG_PARAVIRT=n
>    Download:   332.95 Mbps (data used: 265.4 MB)
>      Upload:   310.06 Mbps (data used: 273.7 MB)
>    Download:   385.97 Mbps (data used: 400.1 MB)
>      Upload:   295.57 Mbps (data used: 133.2 MB)
>    Download:   379.69 Mbps (data used: 394.0 MB)
>      Upload:   293.07 Mbps (data used: 139.4 MB)
>
> linux-image-rt-arm64 with CONFIG_PARAVIRT=n & CONFIG_DEBUG_PREEMPT=n
>    Download:   425.95 Mbps (data used: 753.7 MB)
>      Upload:   347.50 Mbps (data used: 382.8 MB)
>    Download:   423.05 Mbps (data used: 499.4 MB)
>      Upload:   332.48 Mbps (data used: 149.4 MB)

Nice!

> RT kernel specialized for RPi:
> https://github.com/emojifreak/debian-rpi-image-script/blob/main/build-debian-raspi-kernel.sh
>
>    Download:   488.33 Mbps (data used: 514.6 MB)
>      Upload:   416.72 Mbps (data used: 330.8 MB)
>    Download:   504.79 Mbps (data used: 633.5 MB)
>      Upload:   404.07 Mbps (data used: 258.5 MB)

I see you do a couple of things in this fragment. One of them is the
CONFIG_BPF_JIT_ALWAYS_ON=y option that might result in
a significant difference if you actually use BPF (otherwise it makes
no difference).

Given that the numbers here are actually higher than the non-RT
kernel numbers, you clearly hit something very interesting here.

I also see that you enable a number debugging options, including
CONFIG_UBSAN_SANITIZE_ALL=y, which I would expect to make
the kernel significantly slower when turned on. Is this one enabled
in the other kernels as well, or did you find that it has a positive
effect here?

As mentioned above, turning off the unused platforms /should/ not
make a difference other than code size. Do you get different
results if you drop all the CONFIG_ARCH_*=n lines from the
fragment? If you do, I would consider that a problem in the
upstream kernel that needs to be investigated further.

        Arnd


Reply to: