Re: Fast blas1
Camm Maguire wrote:
> Greetings, and thanks for your work on the blas!
Thanks, but it's really not my work, Kazushige Goto deserves the credit. I just
spent a couple of hours plugging it in.
> Have you by chance timed these routines against the atlas package blas? Atlas
> automatically tunes the blas for your particular hardware, and is open source.
No, but I'll give it a try; I suspect the Goto routines (dgemm in particular) will
be at least twice as fast because he does some really interesting unrolling
things, and it's written in very tight assembler. You should see the inner loop
of the gemm routine- it has exactly one add, one multiply, and two load/store
instructions per clock cycle, hand-unrolled to 32 cycles, which totally maxes out
the Alpha's theoretical performance. (I'm not an assembler programmer, just know
a little bit about the chip's operation, and read through his code once a while
back. BTW, the code is so well documented, it's worth the read even if you have
no interest whatsoever in Alpha assembly.) It's *extremely* difficult, probably
impossible, to get near this level of performance from compiled code, even using
Compaq's new FORTRAN Linux compilers.
In fact, his routine is twice as fast as the original assembler BLAS routines
shipped by Digital!! Compaq has included his routines in their CXML library which
ships with their Linux compilers- with the author's permission for a GPL exeption
There are Debian packages for the Q compilers and libs, but they're very
non-free. At some point I'm going to check performance of this enhanced GPL BLAS
package against CXML, and also try compiling the regular BLAS with Q compilers, to
see whether the free libs (they're still free if compiled by a non-free compiler,
right?) are as fast as the non-free CXML.
I'll also at some point try to make a deb of Goto and Joachim Wesner's Free Fast
Math lib for Alpha, which is not quite as fast as CPML, but (is free and) includes
"vectorized" versions of sqrt, cos, sin which are substantially faster than looped
calls to CPML (I think he said something like 12 clocks per argument, but I don't
remember which function that was).
> And what's better, it has hooks allowing the user to provide a small number of
> kernel routines to time against the others and possibly include in the finished
> library. In fact, a few others and myself are working on that right now for the
> PIII, using the kni and prefetch x86 extensions. We're seeing significant gains
> with these instructions, and hope to contribute them to atlas soon.
Sounds interesting. A bit beyond my understanding, but if there's significant
vectorization, could it approach Alpha's 800 MFlops performance? How many add/mul
double-precision floating units are there on PIII?
> The interested reader can check out
Thanks very much, I'll give these a try (probably in a week or two) and report a
comparison, also vs. the non-free Q libs.
BTW, I haven't heard a reply to the legal question yet- "This software is in the
public domain" is GPL-compatible, right? If not, I'll have to take down the
patches and debs. :-(
Adam Powell http://lyre.mit.edu/~powell/
Thomas B. King Assistant Professor of Materials Engineering
77 Massachusetts Ave. Rm. 4-117 Phone (617) 452-2086
Cambridge, MA 02139 USA Fax (617) 253-5418