Re: Atlas proposal
Don Armstrong, le Tue 17 Aug 2010 15:13:15 -0700, a écrit :
> On Tue, 17 Aug 2010, Samuel Thibault wrote:
> > Roger Leigh, le Tue 17 Aug 2010 22:45:50 +0100, a écrit :
> > > Why can't this be fixed the correct way:
> > > by building all optimised variants for a given architecture and
> > > selecting the appropriate variant at runtime based upon the system's
> > > capabilities e.g. from CPUID on i386/amd64?
> > Because atlas doesn't optimise only for the instruction set, but
> > also the number of available cores, the size of the caches, etc.
> > etc.
> All of these are things that can be detected at run time and
> appropriate libraries dlopened or codepaths diverged, etc.
Errr, then you'll need a myriad of libraries/codepaths for all the
combinations of L1/L2/L3 cache sizes, number of processors, speed, etc.
> > > Disabling threading is also suspect: how can the optimal number of
> > > threads possibly be determined at build time?
> > Because it changes how Atlas will statically schedule the
> > computation kernels.
> This answers the wrong half of the question; there's no way to know at
> build time what precisely the machine is going to be doing.
In HPC, yes: the machine will just be running atlas.
> > It's not "wrong", it's HPC. And HPC people will happily rebuild the
> > package to get an optimized version.
> It's wrong even in HPC unless you tweak the settings of atlas
> compilation for your particular problem set as well as your hardware
> and software architecture.
Err, what example of tuning?
The hardware architecture is known: the atlas build system is running on
> But all of that is fine; we can't possibly hope to optimize to get the
> last iota of performance out of a system. We should attempt to provide
> a reasonable set of optimized binaries (whether that means one or ten
> is up to the package maintainer),
The problem is that currently the Atlas build system doesn't have any
way to do generic optimization, and not agressive L1/L2/L3 cache size
-related optimizations which will actually make performance quite worse
whenever running on a machine with a smaller L2 for instance.