Re: Aw: Re: C3600 kernel/64bit 4.* slow IO due to -mlong-calls
On 2018-03-16 9:37 AM, John David Anglin wrote:
The performance difference between with and without long calls is
exaggerated by the I/O
test used for the above results. I see 22:05 and 21:39 hours for a gcc
build and check with
On 2018-03-16 7:25 AM, Helge Deller wrote:
kernel gcc binutils with mlong without mlong
4.15.7 4.9.3 2.25.1 13.4 MB/s 27.0 MB/s
4.15.7 6.4.0 2.25.1 13.4 MB/s 27.0 MB/s
4.15.7 6.4.0 2.29.1 14.4 MB/s 25.0 MB/s
Interesting bad results!
It's hard to understand why the performance would deteriorate so much
but I see essentially the same behavior.
and without kernel long calls on c8000, respectively.
I think the poor performance of long calls is primarily due to the loads
which can trigger
TLB misses. This implies we should work to minimize the impact of TLB
Flushing the whole TLB is quite detrimental to overall performance and
it doesn't scale
well to multiple CPUs. On rp3440, a pdc instruction takes about 570
cycles because of
the broadcast to other CPUs. So, we need to know whether a mapping is
local and possibly
the set of CPUs a mapping applies to.
We don't think we need function descriptors in the kernel. They are
only needed to load a new PIC register.
Speaking of debian kernel, it's nearly impossible to link a kernel
On PA 2.0, this is a 22 bit pc-relative call that has a branch
distance of 8 MB. We have no stub support
in the gnu 64-bit linker. If we had stub support, this would be best
Compiling without mlong-calls generates this (R_PARISC_PCREL22F):
In addition to the argument registers, the argument pointer needs to
be loaded for each call.
This is standard 64-bit indirect call. It calls via a function
descriptor. It assumes the PIC register may change
and the callee may be in a different space (i.e., 64-bit hpux
runtime). The bve instruction is specific to PA 2.0.
With -mlong-calls it is much more complex:
In the kernel, we probably don't need the load of the new PIC register
(omitted from the above).
So, we can load the function address directly from the linkage table.
The above sequence is PIC. It is the same length as the one suggested
by Helge below and the
linker could convert it to Helge's sequence when the call is not
external to the main linux kernel.
It does have one load that might trigger a TLB miss.
I don't know enough about the call sequences used to call functions in
external modules but
it might be easier to do the relocation for the above. It's probably
already handled as the addil/ldd
sequence should already load the address of external_function.
It might also be possible to use a 32-bit PIC pc-relative sequence, but
it is longer and 32-bit
pc-relative relocations might not be supported.
Another option is to use ble (i.e., call sequence generated using
-mfast-indirect-calls). It yields the same length
Since our kernel is running in the first 4GB of RAM (even on 64bit),
couldn't we instead
introduce a gcc option, e.g. "-mkernel-indirect-calls", which
ldil L%external_func, %r2 // R_PARISC_DIR21L
ldo R%external_func(%r2), %r2 // R_PARISC_DIR14R
call sequence as your above sequence and it works on both PA 1.x and 2.0.
The above sequence is not PIC. What about modules?
In the above three sequences, there is a delay slot after the branch
which might be filled by the compiler with a
Does -mfast-indirect-calls has any effect at all?
I haven't seen any difference when using this option.
At the moment, this option only applies to the 32-bit compiler.
I don't remember any huge increase in gcc build time with
-mlong-calls. Calls don't usually dominate performance.
John David Anglin email@example.com