[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: cortex / arm-hardfloat-linux-gnueabi (was Re: armelfp: new architecture name for an armel variant)



On Thu, Jul 15, 2010 at 11:19 AM, Paul Brook <paul@codesourcery.com> wrote:
>> > Enabling use of VFP does not require use of the hard-float ABI. Please
>> > don't confuse the two.
>>
>> The whole point of the port is that we get rid of the softfloat ABI in
>> order to use the VFP unit without playing around moving
>> registers around. This sort of came about from Konstantinos' porting
>> of the Eigen2 library (after he had done it for AltiVec)
>> to NEON and some of the developers noticed it wasn't so much faster
>> because gcc inserts what can only be described as
>> evil between the start of the function and the real meat of the code.
>> The pipeline stalls for register movement are noticable
>> in real code as a 20% or higher performance hit.
>
> Yes, but the point I was responding to is that you don't necessarily need to
> use hard-float ABI to get most of the performance gain.

I believe Konstantinos when he says we sort of do.

I've been working with him on AltiVec stuff for a long while and what
performance gains
we saw and tested on this (and eventually got picked up by YellowDog Linux) were
pretty good. Exactly what you would expect.

Using softfp ABI means that the code you expected to run exactly 4x
faster, actually
can't because there is a significant prologue and epilogue inserted by
the compiler
which is causing pipeline stalls in register moves. This means the 4x
faster code isn't
running for most of the lifetime of the function.

> I completely agree that if you want to use the hard-float ABI then you need a
> new port.
>
> However changing the ABI doesn't solve many of the underlying problem.
> Specifically how to provide optimized binaries that take advantage of new
> features on modern CPUs while still supporting older hardware.

... the point is to not support older hardware. We picked a base
level: armv7-a and vfpv3-d16
is our target for it. The same way lpia picked the Pentium III core,
SSE2 as an FPU and
certain other optimizations as the basis for that port. Not only does
compiling it for that CPU
and the hardfloat ABI level increase performance (instruction set
improvements, reduced
amount of code) but it should in the same way improve battery life
(lpia was reported as
a 10% reduction in power usage). 10% on a 90 minute battery for Atom
notebooks is not
much but we know we can get 6 hours on a Cortex-A8 already with a 2
cell battery. Bump
the battery size to 4 cells and it goes up, oddly enough, to about 12
hours. 10% improvement
in power usage on 12 hours is an extra hour of battery life, 30
minutes for the smaller sizes,
compared to about enough time to put the system into an emergency
standby state on Atom
and keep the system on suspend current until it reaches an AC adapter.

> Switching to the hard-float ABI certainly does give some benefit. While 20%
> isn't a trivial difference, it's important to keep this in context.  This is
> on top of what I'd guess is a 10x (i.e. 1000%) speedup achieved without
> breaking the ABI and requiring a whole new port.

How do you figure a 10x speedup?

> about performance then a NEON optimized version of your critical code should
> get you annother 4x or so on a Cortex-A8.

Yes it's about 4x mathematically but 2x in practice because of the ABI fudging.

>> What would not be so great is that even if it was fixed, the option to
>> use a faster floating point ABI drags in a clone of
>> every package on your system (at the very least, libc, libm, and all
>> the system library dependencies) increasing the
>> size of the installed system.
>
> What you're describing here is multiarch.

Yes, which is needed anyway to support NEON where it's available. But we're
considering taking a more comprehensive base CPU and FPU requirement and
adding a little bit of multiarch (NEON, -fp16) instead of taking the
lowest common
denominator and adding an entire distro worth of multiarch and not
seeing anything
like the performance improvement you'd expect.

Using the hard float ABI and picking a base level of the ARM architecture means
the port is going to run it's best on a certain subset of CPUs which
are going into
Smartphones and Smartbooks right now - OMAP3, iMX51, iMX53, Snapdragon,
Samsung/Apple.. all benefit. Beagleboard, EfikaMX, Nexus One, Tegra2...

That it won't run on other CPUs of a lower pedigree, unfortunate, but armel will
always still work for them. What you have to do is weigh the advantage of
building for last year's base system against the status quo of 15 years ago. In
fact, ARM obseleted 90% of the architecture and cores that "armel" runs on.
This base level we have chosen is the new base level...

(Genesi has a commercial interest on it running on certain ARM926EJ-S cores
and a couple ARM11 cores too, and it won't - i.MX27, i.MX37 and Toshiba TMPA910
will be left out. It's not like we're focusing entirely on what we
want to do. This is
actually going to be a benefit to every modern smartbook processor,
basically lpia
done right)

-- 
Matt


Reply to: