Re: fftw: Usage of SSE in 64bit?
On 6/21/11 1:40 AM, Carsten Aulbert wrote:
In addition to x86-64, note that this is SAFE to enable in general for
all 32-bit x86 platforms. FFTW checks at runtime to see whether the
processor supports SSE/SSE2 and disables its SSE/SSE2 code if not.
(Similarly for Altivec on PowerPC, and similarly in the next release for
Well, that depends what you are aiming for. If you want to have a single 32bit
x86 package which is guaranteed to work for all x86 compatible CPus out there
starting say at a Pentium II level you have to ensure that this will still
work - for my case where I have ~ 1800 computers doing number crunching and
all are 64bit this is another matter then the one Debian has for packaging.
Your example seems a little off because presumably your cluster uses the
amd64 distro and would not be using i386 packages at all.
However, the larger point is that FFTW is designed to reduce this
tension between portability and performance. Running in 32-bit mode, it
is indeed the case that you can have a *single* FFTW binary that runs on
everything from (literally) a 386 to a modern processor, and still get
near-optimal performance (for 32-bit mode) on the modern processor.
Features like SSE2 are automatically enabled on the modern processor and
disabled on old processors like the 386, because we explicitly
segregated the new instructions into separate kernels that we can
disable at runtime.
This way, Debian can have a single binary package of FFTW for each
architecture without sacrificing performance on modern processors.
(Note that Debian should configure FFTW with --enable-portable-binary to
use -mtune instead of -march ... last I checked, Debian already did
this. From what I recall, this makes a near-negligible difference in
performance. FFTW may be somewhat unusual in that it doesn't benefit
too much from arch-specific compiler cleverness ... indeed, we actually
have to manually disable some of gcc's optimizations to prevent them
from screwing up our code schedule.)
For benchmarking, I would recommend using the "bench" program that comes
with FFTW. e.g. you can compare for a size-1024 FFT with and without the
SSE/SSE2 kernels just by doing:
./bench -opatient 1024
./bench -opatient -onosimd 1024
On my 64-bit Intel Xeon E5440 running FFTW 3.2.2 and Debian GNU/Linux,
the SSE/SSE2 version is faster for size 1024 by a factor of 1.7 in
double precision and by a factor of 3.4 in single precision.
Interesting, I think I need to rerun my tests again but then again this could
be that I was just using a 'measured' plan.
No, I get exactly the same performance in measured mode (omit -opatient
above) vs. patient mode -- for such a small transform they give the same
algorithm. (There is a sacrifice in estimate [-oestimate] mode, but
even there I get a 1.5 speedup in double precision and a 3.32 speedup in
single precision.) This is just a stock FFTW 3.2.2 with ./configure
--enable-sse --enable-float or ./configure --enable-sse2, with FFTW's
default compiler flags.
Possibly you are using FFTW suboptimally in some other way, or there is
a problem with your benchmark. e.g. are you including the plan creation
time (or worse, re-creating the plan for each transform)? Or possibly
you have some other problem (e.g. if you repeatedly FFT the same nonzero
array, it is a diverging process and eventually you are timing
floating-point exceptions). If you don't obtain speedups comparable to
mine using FFTW's bench program as above, please email firstname.lastname@example.org.
PS. A general comment: the FFTW authors use Debian ourselves, and we are
very willing to offer advice to Debian packagers (or indeed to packagers
for any GNU/Linux distro). Although I try to search mailing lists
occasionally, it would be easier for us to keep on top of things if
Debian made a greater effort to contact upstream authors when issues arise.