[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

RFC: threading-aware virtual BLAS/LAPACK



Hi fellow devs,

I'm asking advice on a system design issue specific to Debian, which is
whether we should introduce threading-aware BLAS/LAPACK virtual package.
This question is important because it affects a reverse dependency tree
with > 1 million popcon.  (One week ago I asked for comment in -science
team, but I've not got enough feedback yet:
https://lists.debian.org/debian-science/2020/05/msg00023.html)

--- Background ---

BLAS/LAPACK do fundamental linear algebra with dense vectors and
matrices, and is very performance sensitive for many of its rdeps. Apart
from optimizing the cache access and accelerating computations with
architecture specific intrinsics (SIMD), parallelization is also an
inevitable way to boost BLAS/LAPACK performance.

An optimized BLAS/LAPACK library may choose one of these threading
libraries for parallelization: GNU OpenMP (gomp), Intel/LLVM OpenMP
(iomp), TBB, Pthread.

Based on that problem may appear in a relatively deep dependency tree.
Assume that we compiled a scientific program with clang+openmp, which
has been linked against libblas.so.3 and libiomp.so.5 at the same time.
When the libblas.so.3 provider is the openmp version of OpenBLAS, our
program loads libgomp and libiomp at the same time. Undefined behaviour
may happen because iomp/gomp has the same set of symbols.

Similarly, when pthread and openmp are indirectly used at the same time,
undefined consequence may be incurred. e.g.
https://lists.debian.org/debian-science/2020/05/msg00031.html 
where R-4.0.0 calculation speed get ridiculously slow due to the mixed
usage of pthread and openmp.

The fact is that, some of the rdeps of BLAS/LAPACK prefer openmp, some
prefer pthread, while some perfer serial. In that sense, our current
libblas.so.3 alternative is threading-unaware:

 ^ (high priority in alternatives system)
 | OpenBLAS (pthread)             <- libopenblas0, libopenblas0-pthread
 | OpenBLAS (openmp)
 | OpenBLAS (serial)
 | BLIS (openmp)                  <- libblis3, libblis3-openmp
 | BLIS (pthread)
 | BLIS (serial)
 | Atlas (? I forgot it)
 | Netlib (serial)                <- libblas3 | libblas.so.3
 | Intel-MKL (gomp/iomp/tbb)      <- libmkl-rt (non-free)
 v (low priority in alternatives system)

By default, the standard serial implementation (Netlib) will be installed
to satisfy the "libblas3 | libblas.so.3" dependency, as the libblas.so.3
provider. But this implementation can be >40x slower than the fastest impl.

--- RFC / My Tentative Proposal ---

 * Apart from libblas.so.3, we create more virtual BLAS/LAPACK alternative
   groups. E.g. libblasp.so.3 for a pthread BLAS, libblaso.so.3 for an
   openmp BLAS, and libblass.so.3 for a serial BLAS.

(The way of implementation is not my point here.
 --  Let's focus on the Debian system design issue)

In that way
 (1) maintainers who have a specific requirement on threading library
     can directly require a BLAS implementation with specific threading lib.
 (2) the threading troubles won't be propagated to our end users again and
     again.

Downsides:
 (1) currently src:openblas builds 12 versions of OpenBLAS shared object.
     If we decide to add threading-aware virtual packages, it will build
     18 versions of openblas.
 (2) less debian developers will be able to readily handle our BLAS/LAPACK
     ecosystem except for Sébastien and me ....

Please comment:
 1. Do we have a better solution where we can retain high performance and
    avoid threading trouble at the same time?
 2. If we don't have a better solution, is my proposal acceptable?
 3. In which way can my proposal be improved?

Thank you in advance. :-)

Ack
---
This work is supported by GSoC2020.


Reply to: