[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: fftw: Usage of SSE in 64bit?

On 6/21/11 1:40 AM, Carsten Aulbert wrote:
In addition to x86-64, note that this is SAFE to enable in general for
all 32-bit x86 platforms.  FFTW checks at runtime to see whether the
processor supports SSE/SSE2 and disables its SSE/SSE2 code if not.
(Similarly for Altivec on PowerPC, and similarly in the next release for
AVX instructions.)

Well, that depends what you are aiming for. If you want to have a single 32bit
x86 package which is guaranteed to work for all x86 compatible CPus out there
starting say at a Pentium II level you have to ensure that this will still
work - for my case where I have ~ 1800 computers doing number crunching and
all are 64bit this is another matter then the one Debian has for packaging.

Your example seems a little off because presumably your cluster uses the amd64 distro and would not be using i386 packages at all.

However, the larger point is that FFTW is designed to reduce this tension between portability and performance. Running in 32-bit mode, it is indeed the case that you can have a *single* FFTW binary that runs on everything from (literally) a 386 to a modern processor, and still get near-optimal performance (for 32-bit mode) on the modern processor. Features like SSE2 are automatically enabled on the modern processor and disabled on old processors like the 386, because we explicitly segregated the new instructions into separate kernels that we can disable at runtime.

This way, Debian can have a single binary package of FFTW for each architecture without sacrificing performance on modern processors.

(Note that Debian should configure FFTW with --enable-portable-binary to use -mtune instead of -march ... last I checked, Debian already did this. From what I recall, this makes a near-negligible difference in performance. FFTW may be somewhat unusual in that it doesn't benefit too much from arch-specific compiler cleverness ... indeed, we actually have to manually disable some of gcc's optimizations to prevent them from screwing up our code schedule.)

For benchmarking, I would recommend using the "bench" program that comes
with FFTW. e.g. you can compare for a size-1024 FFT with and without the
SSE/SSE2 kernels just by doing:
      ./bench -opatient 1024
      ./bench -opatient -onosimd 1024
On my 64-bit Intel Xeon E5440 running FFTW 3.2.2 and Debian GNU/Linux,
the SSE/SSE2 version is faster for size 1024 by a factor of 1.7 in
double precision and by a factor of 3.4 in single precision.

Interesting, I think I need to rerun my tests again but then again this could
be that I was just using a 'measured' plan.

No, I get exactly the same performance in measured mode (omit -opatient above) vs. patient mode -- for such a small transform they give the same algorithm. (There is a sacrifice in estimate [-oestimate] mode, but even there I get a 1.5 speedup in double precision and a 3.32 speedup in single precision.) This is just a stock FFTW 3.2.2 with ./configure --enable-sse --enable-float or ./configure --enable-sse2, with FFTW's default compiler flags.

Possibly you are using FFTW suboptimally in some other way, or there is a problem with your benchmark. e.g. are you including the plan creation time (or worse, re-creating the plan for each transform)? Or possibly you have some other problem (e.g. if you repeatedly FFT the same nonzero array, it is a diverging process and eventually you are timing floating-point exceptions). If you don't obtain speedups comparable to mine using FFTW's bench program as above, please email fftw@fftw.org.


PS. A general comment: the FFTW authors use Debian ourselves, and we are very willing to offer advice to Debian packagers (or indeed to packagers for any GNU/Linux distro). Although I try to search mailing lists occasionally, it would be easier for us to keep on top of things if Debian made a greater effort to contact upstream authors when issues arise.

Reply to: