# Re: Fast blas1

(Ancient thread, but here goes...)
Adam C Powell IV wrote:
> Camm Maguire wrote:
>
> > Greetings, and thanks for your work on the blas!
>
> Thanks, but it's really not my work, Kazushige Goto deserves the credit. I just
> spent a couple of hours plugging it in.
>
> > Have you by chance timed these routines against the atlas package blas? Atlas
> > automatically tunes the blas for your particular hardware, and is open source.
>
> No, but I'll give it a try; I suspect the Goto routines (dgemm in particular) will
> be at least twice as fast because he does some really interesting unrolling
> things, and it's written in very tight assembler.
Okay, I was wrong. I finally ran some tests with the FORTRAN BLAS, Atlas, and Goto's
dgemm. Goto's is faster, but not by nearly as much as I had thought. My little
program uses dgemm (matrix multiply), dgetrf (LU decompose) and dtrsm
(back-substitution), and a weighted average gave the following results (in MFlop/s):
Platform FORTRAN atlas Goto's
ev5 53 331 550
ev6 191 681 830
ev5 here is LX164 600MHz 2MB cache; ev6 is 667 MHz UP2000. So Goto's assembler
routines are quite a bit faster, especially on ev5, but not nearly the "at least
twice as fast" that I had thought. Atlas has some truly impressive performance,
especially for compiled code!
Interesting notes:
* On the ev6, the best dgemm performance I saw from Goto's routine was 1079
MFlop/s!! (6kx6k times 6kx100 in under 6.7 seconds).
* Using atlas, ev6 might be worth the extra money it costs; using Goto's, LX164 is
still better performance/price (if you can get it) for this particular
application of dense linear algebra.
So that's what I know. Once again, the patches against the blas source package, and
ev5 debs using them, are in http://lyre.mit.edu/~powell/debs/ (note that Goto's
routines are GPL).
-Adam P.

**Reply to:**