[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: BLAS recommendations



Greetings!

Emil Briggs <briggs@tick.physics.ncsu.edu> writes:

> >
> >   a) What is the "best" way to add BLAS to a 'wulfish cluster of PPro's
> >and PII's?  I say best instead of fastest or cheapest or GPL'd-est to
> >allow for a variety of personal interpretations of the word best.  My
> >own would be very much GPL, SRPM based (and fastest possible within that
> >constraint) if such a thing were possible, but I'd love to hear about
> >fastest under any circumstances, best for sale, and so forth as well.
> >   b) How is BLAS documented?  Is it of the "if you have to ask, you
> >shouldn't be using it" variety?  Books?  A manual somewhere?
> >   c) Is there e.g. a website for software written to make use of BLAS?
> >   d) Are any of the above really ignorant questions?  If so please
> >Enlighten me...
> >
> 
> For recommendations I would say that ATLAS is a great choice for the
> Level3 stuff. If you've got an Athlon then my libs are OK
> for the Level1. No recommendations for Level2 (Our codes don't
> require much Level2 stuff so it hasn't been a high priority for me).
> 
> 

What he said!  Debian has an atlas package which supplies a binary
compatible optimized blas library accessible at runtime via
LD_LIBRARY_PATH.  Just to put some numbers behind what Emil said, here
are results on our cluster of 16 PII 350s:

xd3blastst: (compares atlas and reference blas on a single processor)

DGEMM
TEST  TA  TB    M    N    K  alpha   beta    Time  Mflop  SpUp  PASS
====  ==  ==  ===  ===  ===  =====  =====  ======  =====  ====  ====

   2   N   N  200  200  200    1.0    0.0    0.33   48.5  1.00   ---
   2   N   N  200  200  200    1.0    0.0    0.07  228.6  4.71   YES
   3   N   N  300  300  300    1.0    0.0    1.64   32.9  1.00   ---
   3   N   N  300  300  300    1.0    0.0    0.20  270.0  8.20   YES
   4   N   N  400  400  400    1.0    0.0    4.22   30.3  1.00   ---
   4   N   N  400  400  400    1.0    0.0    0.50  256.0  8.44   YES
   5   N   N  500  500  500    1.0    0.0    8.22   30.4  1.00   ---
   5   N   N  500  500  500    1.0    0.0    0.95  263.2  8.65   YES


mpi xdlutime

Simple Timer for ScaLAPACK routine PDGESV
Number of processors used:  16

TIME     N  NB   P   Q  LU Time   Sol Time  MFLOP/S Residual  CHECK
---- ----- --- --- --- --------- --------- -------- -------- -------
WALL   100  64   4   4      0.36      0.02     1.79 0.010005 PASSED
WALL  4096  64   4   4    104.68      0.66   435.14 0.003757 PASSED



LD_LIBRARY_PATH=/usr/lib/atlas mpi -x LD_LIBRARY_PATH xdlutime

Simple Timer for ScaLAPACK routine PDGESV
Number of processors used:  16

TIME     N  NB   P   Q  LU Time   Sol Time  MFLOP/S Residual  CHECK
---- ----- --- --- --- --------- --------- -------- -------- -------
WALL   100  64   4   4      0.15      0.08     2.99 0.004056 PASSED
WALL  4096  64   4   4     32.83      0.65  1369.22 0.000857 PASSED



mpi /usr/lib/scalapack/xdpblas3tim-lam


ScaLAPACK Level-3 PBLAS timing program.
'Intel iPSC/860 hypercube, gamma model.'                                       

Tests of the real double precision Level-3 PBLAS

  Number of Tests           :      1
  Number of process grids   :      1
  P                         :      4
  Q                         :      4
  Alpha                     :      2.00000    
  Beta                      :      3.00000    
  Routines to be tested     :      PDGEMM  ... Yes
                                   PDSYMM  ... Yes
                                   PDSYRK  ... Yes
                                   PDSYR2K ... Yes
                                   PDTRAN  ... Yes
                                   PDTRMM  ... Yes
                                   PDTRSM  ... Yes

  Tests started.

  Test number  1 started on a    4 x    4 process grid.

     -------------------------------------------------------------------
          M      N      K    SIDE  UPLO  TRANSA  TRANSB  DIAG
     -------------------------------------------------------------------
       1024   1024   1024      L     U      N       N      N
     -------------------------------------------------------------------
         IA     JA     MA     NA    MBA    NBA RSRCA CSRCA
     -------------------------------------------------------------------
          1      1   1024   1024     64     64     0     0
     -------------------------------------------------------------------
         IB     JB     MB     NB    MBB    NBB RSRCB CSRCB
     -------------------------------------------------------------------
          1      1   1024   1024     64     64     0     0
     -------------------------------------------------------------------
         IC     JC     MC     NC    MBC    NBC RSRCC CSRCC
     -------------------------------------------------------------------
          1      1   1024   1024     64     64     0     0
     -------------------------------------------------------------------
              WALL time (s)    WALL Mflops   CPU time (s)     CPU Mflops
     PDGEMM           3.583        599.366         -1.000          0.000
     PDSYMM           5.254        408.731         -1.000          0.000
     PDSYRK           2.896        371.170         -1.000          0.000
     PDSYR2K          5.760        372.799         -1.000          0.000
     PDTRAN           0.128          0.000         -1.000          0.000
     PDTRMM           2.862        375.217         -1.000          0.000
     PDTRSM           5.727          0.000         -1.000          0.000
     -------------------------------------------------------------------

  Test number  1 completed.

  End of Tests.

LD_LIBRARY_PATH=/usr/lib/atlas mpi -x LD_LIBRARY_PATH /usr/lib/scalapack/xdpblas3tim-lam


ScaLAPACK Level-3 PBLAS timing program.
'Intel iPSC/860 hypercube, gamma model.'                                       

Tests of the real double precision Level-3 PBLAS

  Number of Tests           :      1
  Number of process grids   :      1
  P                         :      4
  Q                         :      4
  Alpha                     :      2.00000    
  Beta                      :      3.00000    
  Routines to be tested     :      PDGEMM  ... Yes
                                   PDSYMM  ... Yes
                                   PDSYRK  ... Yes
                                   PDSYR2K ... Yes
                                   PDTRAN  ... Yes
                                   PDTRMM  ... Yes
                                   PDTRSM  ... Yes

  Tests started.

  Test number  1 started on a    4 x    4 process grid.

     -------------------------------------------------------------------
          M      N      K    SIDE  UPLO  TRANSA  TRANSB  DIAG
     -------------------------------------------------------------------
       1024   1024   1024      L     U      N       N      N
     -------------------------------------------------------------------
         IA     JA     MA     NA    MBA    NBA RSRCA CSRCA
     -------------------------------------------------------------------
          1      1   1024   1024     64     64     0     0
     -------------------------------------------------------------------
         IB     JB     MB     NB    MBB    NBB RSRCB CSRCB
     -------------------------------------------------------------------
          1      1   1024   1024     64     64     0     0
     -------------------------------------------------------------------
         IC     JC     MC     NC    MBC    NBC RSRCC CSRCC
     -------------------------------------------------------------------
          1      1   1024   1024     64     64     0     0
     -------------------------------------------------------------------
              WALL time (s)    WALL Mflops   CPU time (s)     CPU Mflops
     PDGEMM           1.396       1537.853         -1.000          0.000
     PDSYMM           2.904        739.497         -1.000          0.000
     PDSYRK           1.271        845.707         -1.000          0.000
     PDSYR2K          2.482        865.127         -1.000          0.000
     PDTRAN           0.128          0.000         -1.000          0.000
     PDTRMM           1.476        727.324         -1.000          0.000
     PDTRSM           3.456          0.000         -1.000          0.000
     -------------------------------------------------------------------

  Test number  1 completed.

  End of Tests.


When using a 4096x4096 matrix, the atlas results aproach 3 gigaflops
on the pdgemm.

Take care,



> Regards
> Emil
>  
> -------------------------------------------------------------------
> To unsubscribe send a message body containing "unsubscribe"
> to beowulf-request@beowulf.org

-- 
Camm Maguire			     			camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah


Reply to: