Re: developer release 3.1.2
1) It looks as though the prefetch assisted double precision level2
will max out at about 50% + standard atlas. Transpose: 94 ->140,
Notrans: 67 -> 97. dger remains to be completed. Basically, I
just looked at the atlas compiled assembler, and added prefetch.
So the rule of thumb appears to be SIMD +50%, prefetch +50%,
2) If anyone has a handy reference for the Athlon SIMD instructions, I
think these routines will port over to that platform with minimal
change. We even have an Athlon that I can try out :-)
3) I do hope we can find a solution for distributed atlas binaries. I
know the idea is for the user to build atlas on each platform they
will use, and that the current tree will skip any routines which
fail to compile on a given platform, (i.e. if there is no SIMD
support). Serious users will do this no doubt.
But I do maintain an atlas package for Debian, and I've found that,
while the distributed library is obviously not completely optimal,
it is very frequently significantly better than the reference blas,
and gives new users a quick way to try atlas out to see if its
worth their while. (The Debian package provides an atlas drop-in
shared library replacement for the standard blas, so one can
compare performance gains at runtime simply by setting the
LD_LIBRARY_PATH environment variable.)
One solution is to compile several versions covering the most
common platforms. Perhaps this is best. However, this strategy
has the potential for confusion, both at compile time, and for
users inadvertently installing the wrong binary, getting a crash,
and filing a bug. Another option is to have a flag somewhere in
the build process specifying "compiled code only" or some such.
Then we could leave the SIMD gains to the serious users and maybe
provide a note to this effect in the docs accompanying the package.
I really don't know what to do, I just thought I'd mention it.
4) Do we have an idea as to when we might want to release a
SIMD-enhanced atlas, say in Debian? Is there any word on the most
important level3 front?
R Clint Whaley <firstname.lastname@example.org> writes:
> I just returned from vacation; thanks a lot for the complex stuff.
> Looks like GEMV is faster than GEMM for complex; I will look into this,
> see if we want a GEMV-based GEMM for this platform, until the emmerald stuff
> comes through :) My guess is no (short vector lengths), but worth a shot,
> I guess . . .
> Anyway, I haven't scoped it yet, but I'll get it in to a developer
> release ASAP (I'm working right now mostly on getting Antoine's
> pthreads stuff in) . . .
> As to your question, I do not believe inlining is a problem when
> the entire lib is compiled with same flags (I have never experienced
> the problems with inlining you spoke about); the only tricky thing
> I know of is that you need to make damn sure to prototype routines
> when they contain single precision scalars; not doing so can cause
> seg or bus faults . . .
Camm Maguire email@example.com
"The earth is but one country, and mankind its citizens." -- Baha'u'llah