Re: cortex / arm-hardfloat-linux-gnueabi (was Re: armelfp: new architecture name for an armel variant)
Please understand we know what we're talking about here :D
We want a new port that uses the hard floating point version of the
EABI - floating point arguments to functions are passed in floating
point registers (sN, dN, qN) as the ABI allows.
This is to get around the fact that in soft (no FPU at all) and softfp
mode (can use FPU) the EABI is defined such that all floating point
arguments to a function are passed in integer registers - r0, r1 and
so on, and in pairs of registers for double arguments. In the case
where FPU instruction generation is enabled (softfp and fpu=vfpv3 for
example - softfp abi does not imply an FPU), there is significant code
inserted by the compiler that moves data from integer to floating
point registers before the FPU can use it.
As Konstantinos explained, this is something of the order of 6 moves
around from integer to float and back again for a relatively simple
function like sinf() which takes one floating point argument and
returns one. vmov rN, sN has a penalty of around 20 cycles. Nothing is
being done here, it stalls the entire pipeline until it is complete.
You can't schedule around it. And it does this 6 times.
The basic benefits for moving around are
* FPU is all emulated. FPU work is done in integer registers.
* actual FPU used, FPU argument passing done in integer registers due
to the soft/softfp EABI spec. your 10x speedup is here and comes from
using the FPU instead of emulating it
* You can use NEON here but you still are limited to passing float
arguments in integer registers per the ABI
* Each register transfer from integer to float register costs about 20 cycles
* Boost in performance from using the FPU or NEON instead of emulation
* Hidden performance penalty from the register transfers
* Compatible with the above - soft and softfp code can be mixed
* actual FPU is used in the same way
* actual FPU code does not run faster
* Boost in performance from using the FPU or NEON is the same
* No hidden performance penalty
* Completely incompatible ABI with the two above - no code mixing.
That is what we're proposing. This, coupled with the benefits of
compiling for an improved ISA (ARMv7-A instead of ARMv4) with better,
more efficient instructions, potentially a slightly different strategy
for scheduling instructions, and removing the need to run emulated FPU
library code by specifying VFPv3-D16 as the base level of FPU
required. Using VFPv3-D16 in the base system means not having to deal
with Debian multilib just to get FPU code. Everything is FPU enabled
by default. Debian multilib would be used to enable extra features
such as NEON (which is still not in every ARMv7 processor) and the
FP16 extension (which isn't present on any A8)
That's what justifies the port, the fact that the ABI is incompatible,
plus the baseline architecture requirement (it will no longer run on
ARMv4 or ARMv5 .. ARMv6 is possible if you're lucky) plus the baseline
FPU requirement (needs VFPv3-D16 at least).
Why VFPv3-D16? Simply because VCVT and VMOV immediate has immediate
optimization opportunities. Converting between integer and floating
point is a very common need (think floor() and ceil() kind of stuff),
and being able to put immediate values in FP registers is the first
thing you learn when optimizing for AltiVec (vec_splat is your
greatest friend!) to reduce the need to access memory which causes
pipeline stalls. Because we're used to using AltiVec we think we'd
absolutely, positively miss that functionality if we had to be
restricted to VFPv2 which does not include them :)
No, we don't need to do anything but change the ABI for the purpose of
the port, but all the multilib mess of 10 different FPU types,
slightly better architectures (5, 6) than the one the port is compiled
for.. this is an opportunity to clean it up a bit and reduce the
workload by standardizing at the very least to a common denominator
(which just happens to be the Marvell ARMADA 500) while running well
on Tegra2 and working at still pretty close to best performance
possible on Snapdragon and iMX51 and OMAP3.
I am fairly sure (oh you did!) find a contrived benchmark to show that
some code is faster on softfp in some cases, but taking a holistic
approach I find it hard to believe that every time a floating point
function is called across any of 20,000 packages possibly running on a
system in a Debian port, that you will be able to benchmark a
softfp+vfp system running faster than a hard+vfp one, and the features
outlined above in the VFPv3 spec, the ability to judge the benefits of
VFP vs. NEON without compilers generating special magic in the way,
will help people out and make for a "nicer" system.
Any optimizations made on this "performance blend" of Debian for
armv7+hard+vfp when it comes to NEON will backport easily to the
"armel" port and work just the same with the same relative
improvement, they just won't have the *base* performance of the port.
Anyway I think everyone is agreed on that it should be done, just not the name..
Matt Sealey <firstname.lastname@example.org>
Product Development Analyst, Genesi USA, Inc.