[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: About the sense of removing -march=native (Was: Is theano worth saving?)



On 09.02.2017 07:23, lumin wrote:
> 
>> So do you think we are doing a bad service to our users by striping
>> -march=native?  Could you please provide some numbers?
> 
> No, we are not doing bad. Nobody is wrong. We cannot gain compatibility
> and performance at the same time. I don't remember the exact numbers of
> those experiments conducted 4 months ago. Here's the fuzzy data
> on my Torch-based program:
> 
>  (1) generic openBLAS:
>      i7-6900K is only capable of ~1 experiment process at the same time.
>      E5-2687Wv4 ~1
>  (2) -march=native openBLAS:
>      i7-6900K is only capable of ~2 experiment processes at the same time.
>      E5-2687Wv4 ~2
>  (3) -march=native openBLAS + proper OMP_NUM_THREADS:
>      i7-6900K is capable of ~6 experiment processes at the same time.
>      E5-2687Wv4 ~8
>  (4) generic OpenBLAS + proper OMP_NUM_THREADS : not tested.
> 
> So the tuned OpenBLAS is >= 6x "faster" for us...
> Sorry for the ambiguous word "faster".
> 
> I wrote the -march=native example in order to illustrate that some
> users needs to compile specific software by themselves and don't
> quite need a .deb package.

These numbers are very surprising to me.
OpenBLAS main advantage is that it runtime detects the cpu.
And all its performance sensitive parts are written in assembly (which
is not a good thing, you should be using intrinsics nowadays)
So march=native should do nothing.
Last I tested it against a tuned ATLAS it performed the same.

Are you maybe testing very small matrices? OpenBLAS performs horribly
with those and march=native *could* improve this by reducing the
overhead a tiny bit.

The best thing we can do is encourage upstreams to do runtime detection
like openblas or fftw3 and now also numpy are doing.
This is easiest on the distributions, but it is a lot more work for
upstream unfortunately.

Btw. since gcc 6 it can do target specific auto cloning of functions for
multiple instruction sets at runtime, that can simplify it a bit.

cheers,
Julian


Reply to: