[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Summary of fftw findings on amd64 Was: fftw: Usage of SSE in 64bit?



Hi all,

this one dragged on longer than wanted, but daily work sucks away a great deal 
of time :)

Attached to this message you will find 3 graphs and a tarball of summary 
results all obtained on one of our compute nodes which have been otherwise 
idle. The system is (still) running Debian Lenny in amd64 flavor, CPU is a 
Xeon X3220, 2.4 GHz Quad core (though I only used a single core for these 
tests), 8 GB RAM, FFT sizes range from 2**4 to 2**27 points, out-of-place 
transforms

I used the stock debian package as a reference (3.1.2-3.1) and recompiled 
versions there-of with different options (mixture of --enable-ssh --enable-fma 
--enable-alloca). Most significantly the change was when using SSE on amd64 
which gave almost a factor of two in speed. It's true that gcc automatically 
enables SSE enhancements on 64bit, but it seems FFTW has also special code 
optimizations for SSE which we don't use with stock Debian fftw.

Thus my request would be to use --enable-sse on amd64 as well, i.e. patch the 
debian/rules file.

OK, let's the discussion begin ;)

Cheers

Carsten

Gory details:

%%%%%%%%%%%%%%%%%%
debian-vs-optim-estimated-plan.svg

These test were performed with the FFTW_ESTIMATE plan with the following FFTW 
compile options:
debian: stock libraries from /usr/lib
alloca: recompiled with --enable-alloca
fftw-default: baseline check, recompiled fftw without special options
fma: recompiled with --enable-fma
fma-sse-alloca: recompiled with --enable-fma --enable-alloca --enable-sse
sse: recompiled with --enable-sse
core2+all else: recompiled with -mtune=core2 and all of fma-sse-alloca

Clear result from this (apart from hitting different CPU cache size limits) is 
that just enabling sse yields a performance boost of up to 100%

%%%%%%%%%%%%%%%%%%%
debian-vs-optim-measure-plan.svg

Same as above, but now with FFTW_MEASURE yielding essentially the same, that 
we want to have better amd64 libraries in Debian ;)

%%%%%%%%%%%%%%%%%%%
debian-final.svg

Final comparison between stock Debian fftw and --enable-sse recompiled 
version, here one sees multiple things:

* Users want to have --enable-sse for amd64 :)
* Users should always use FFTW_MEASURE (or even more and save their plan) if 
they plan to use fftw heavily.

%%%%%%%%%%%%%%%%%%%
Raw result files have a simple column-oriented structure:

1. size of FFT
2. theoretically needed flops to perform one FFT (5*N*log_2(N)/2)

3. time for plan generation in microseconds
4. time per FFT in nanoseconds
5. number of iterations (each test ran for at least 60s)
6. theoretical MFlops/s of CPU (that's what plotted above)

the latter 4 columns are repeated for each plan, i.e. here these are for 
FFTW_ESTIMATE and FFTW_MEASURE

Attachment: debian-vs-optim-measure-plan.svg
Description: image/svg

Attachment: debian-vs-optim-estimated-plan.svg
Description: image/svg

Attachment: debian-final.svg
Description: image/svg

Attachment: fftw_raw_results.tar.gz
Description: application/compressed-tar


Reply to: