Re: Aw: Re: C3600 kernel/64bit 4.* slow IO due to -mlong-calls

To: Helge Deller <deller@gmx.de>
Cc: debian-hppa@lists.debian.org, linux-parisc <linux-parisc@vger.kernel.org>
Subject: Re: Aw: Re: C3600 kernel/64bit 4.* slow IO due to -mlong-calls
From: John David Anglin <dave.anglin@bell.net>
Date: Sun, 18 Mar 2018 09:31:41 -0400
Message-id: <[🔎] bb7defc8-c53e-7842-a847-a4d792e634de@bell.net>
In-reply-to: <[🔎] e3b11e2e-490a-5882-2bfa-a199ae4a2634@bell.net>
References: <[🔎] CA+QBN9C=wHxV5Y=SSt2UB_4b1H_Nx1mZxZu3rvuTY2wRe=GWBw@mail.gmail.com> <[🔎] d637ebd1-db27-585b-c324-0e02baf8c8b2@gmx.de> <[🔎] 113ad157-e8c3-81dc-5752-75416c8ce531@bell.net> <[🔎] trinity-f7583046-4aa6-488a-a797-20501118a1c0-1521199558775@3c-app-gmx-bs75> <[🔎] e3b11e2e-490a-5882-2bfa-a199ae4a2634@bell.net>

On 2018-03-16 9:37 AM, John David Anglin wrote:

On 2018-03-16 7:25 AM, Helge Deller wrote:

kernel  gcc     binutils    with mlong    without mlong
4.15.7  4.9.3   2.25.1     13.4 MB/s    27.0 MB/s
4.15.7  6.4.0   2.25.1     13.4 MB/s    27.0 MB/s
4.15.7  6.4.0   2.29.1     14.4 MB/s    25.0 MB/s

Interesting bad results!

It's hard to understand why the performance would deteriorate so much
but I see essentially the same behavior.

The performance difference between with and without long calls isexaggerated by the I/Otest used for the above results. I see 22:05 and 21:39 hours for a gccbuild and check with

and without kernel long calls on c8000, respectively.

I think the poor performance of long calls is primarily due to the loadswhich can triggerTLB misses. This implies we should work to minimize the impact of TLBflushes.Flushing the whole TLB is quite detrimental to overall performance andit doesn't scalewell to multiple CPUs. On rp3440, a pdc instruction takes about 570cycles because ofthe broadcast to other CPUs. So, we need to know whether a mapping islocal and possibly

the set of CPUs a mapping applies to.

Speaking of debian kernel, it's nearly impossible to link a kernelwithout mlong-calls.
Compiling without mlong-calls generates this (R_PARISC_PCREL22F):
         b,l external_func,%r2
         nop
On PA 2.0, this is a 22 bit pc-relative call that has a branchdistance of 8 MB. We have no stub supportin the gnu 64-bit linker. If we had stub support, this would be bestsolution.
In addition to the argument registers, the argument pointer needs tobe loaded for each call.
With -mlong-calls it is much more complex:
.LC0:
         .dword  P%external_func
.globl a
a:
         addil LT'.LC0,%r27
         ldd RT'.LC0(%r1),%r28
         ldd 0(%r28),%r28
         ldd 16(%r28),%r2
         bve,l (%r2),%r2
This is standard 64-bit indirect call. It calls via a functiondescriptor. It assumes the PIC register may changeand the callee may be in a different space (i.e., 64-bit hpuxruntime). The bve instruction is specific to PA 2.0.
b
In the kernel, we probably don't need the load of the new PIC register(omitted from the above).

We don't think we need function descriptors in the kernel. They areonly needed to load a new PIC register.

So, we can load the function address directly from the linkage table.

        addil LT'external_function,%r27
        ldd RT'external_function(%r1),%r2
        bve,l (%r2),%r2
        Delay slot

The above sequence is PIC. It is the same length as the one suggestedby Helge below and thelinker could convert it to Helge's sequence when the call is notexternal to the main linux kernel.

It does have one load that might trigger a TLB miss.

I don't know enough about the call sequences used to call functions inexternal modules butit might be easier to do the relocation for the above. It's probablyalready handled as the addil/ldd

sequence should already load the address of external_function.

It might also be possible to use a 32-bit PIC pc-relative sequence, butit is longer and 32-bit

pc-relative relocations might not be supported.

Since our kernel is running in the first 4GB of RAM (even on 64bit),couldn't we insteadintroduce a gcc option, e.g. "-mkernel-indirect-calls", whichtranslates to:
         ldil    L%external_func, %r2        // R_PARISC_DIR21L
         ldo     R%external_func(%r2), %r2   // R_PARISC_DIR14R
         bve,l (%r2),%r2
Another option is to use ble (i.e., call sequence generated using-mfast-indirect-calls). It yields the same length
call sequence as your above sequence and it works on both PA 1.x and 2.0.

The above sequence is not PIC.  What about modules?
In the above three sequences, there is a delay slot after the branchwhich might be filled by the compiler with a
useful instruction.
Does -mfast-indirect-calls has any effect at all?
I haven't seen any difference when using this option.
At the moment, this option only applies to the 32-bit compiler.
Thoughts?
I don't remember any huge increase in gcc build time with-mlong-calls. Calls don't usually dominate performance.
Dave


--
John David Anglin  dave.anglin@bell.net

Reply to:

Follow-Ups:
- Re: C3600 kernel/64bit 4.* slow IO due to -mlong-calls
  - From: Carlo Pisani <carlojpisani@gmail.com>

References:
- C3600 kernel/64bit 4.* slow IO due to -mlong-calls
  - From: Carlo Pisani <carlojpisani@gmail.com>
- Re: C3600 kernel/64bit 4.* slow IO due to -mlong-calls
  - From: Helge Deller <deller@gmx.de>
- Re: C3600 kernel/64bit 4.* slow IO due to -mlong-calls
  - From: John David Anglin <dave.anglin@bell.net>
- Aw: Re: C3600 kernel/64bit 4.* slow IO due to -mlong-calls
  - From: "Helge Deller" <deller@gmx.de>
- Re: Aw: Re: C3600 kernel/64bit 4.* slow IO due to -mlong-calls
  - From: John David Anglin <dave.anglin@bell.net>

Prev by Date: Re: kernel 4.15.7/64bit, C3600 is unstable during heavy I/O on PCI
Next by Date: Re: C3600 kernel/64bit 4.* slow IO due to -mlong-calls
Previous by thread: Re: Aw: Re: C3600 kernel/64bit 4.* slow IO due to -mlong-calls
Next by thread: Re: C3600 kernel/64bit 4.* slow IO due to -mlong-calls
Index(es):
- Date
- Thread