[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Fast blas1

(Ancient thread, but here goes...)

Adam C Powell IV wrote:

> Camm Maguire wrote:
> > Greetings, and thanks for your work on the blas!
> Thanks, but it's really not my work, Kazushige Goto deserves the credit.  I just
> spent a couple of hours plugging it in.
> > Have you by chance timed these routines against the atlas package blas?  Atlas
> > automatically tunes the blas for your particular hardware, and is open source.
> No, but I'll give it a try; I suspect the Goto routines (dgemm in particular) will
> be at least twice as fast because he does some really interesting unrolling
> things, and it's written in very tight assembler.

Okay, I was wrong.  I finally ran some tests with the FORTRAN BLAS, Atlas, and Goto's
dgemm.  Goto's is faster, but not by nearly as much as I had thought.  My little
program uses dgemm (matrix multiply), dgetrf (LU decompose) and dtrsm
(back-substitution), and a weighted average gave the following results (in MFlop/s):

Platform  FORTRAN  atlas  Goto's
ev5          53     331    550
ev6         191     681    830

ev5 here is LX164 600MHz 2MB cache; ev6 is 667 MHz UP2000.  So Goto's assembler
routines are quite a bit faster, especially on ev5, but not nearly the "at least
twice as fast" that I had thought.  Atlas has some truly impressive performance,
especially for compiled code!

Interesting notes:

   * On the ev6, the best dgemm performance I saw from Goto's routine was 1079
     MFlop/s!!  (6kx6k times 6kx100 in under 6.7 seconds).
   * Using atlas, ev6 might be worth the extra money it costs; using Goto's, LX164 is
     still better performance/price (if you can get it) for this particular
     application of dense linear algebra.

So that's what I know.  Once again, the patches against the blas source package, and
ev5 debs using them, are in http://lyre.mit.edu/~powell/debs/ (note that Goto's
routines are GPL).

-Adam P.

Reply to: