[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#921207: Octave GEMM error on large matrix due to openmp thread race condition



control: severity -1 important

Hi Sébastien and Sylvestre,

On Sun, Feb 03, 2019 at 10:16:05AM +0100, Sébastien Villemot wrote:
> Control: tags -1 unreproducible
> 
> Dear Lumin,
> 
> I've tried to reproduce the problem with Netlib BLAS, OpenBLAS and
> BLIS, but without success (I did not try with MKL since I don't want
> such a large binary blob on my system).

I tried to reproduce this issue in a docker container. It seems that
the problem only occurs after the installation of libmkl-rt. I feel
very strange and tried to do some further research:

root@6c3d05276fb0:~/x# cat x.sh
for i in $(seq 10); do octave -q --no-gui a.m ; done

root@6c3d05276fb0:~/x# OMP_NUM_THREADS=2 MKL_THREADING_LAYER=intel MKL_NUM_THREADS=1 sh x.sh
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000

root@6c3d05276fb0:~/x# OMP_NUM_THREADS=2 MKL_THREADING_LAYER=intel MKL_NUM_THREADS=2 sh x.sh
   641731270638496   641731270638496   641348657394560   641348657394560
   484510470256176   484510470256176   485846751162000   485846751162000
   640975146516736   640975146516736   641915530512672   641915530512672
   646390298635168   646390298635168   646390298635168   646390298635168
   541379330715152   541379330715152   546317613096592   546317613096592
   495137794802960   495137794802960   497281139942736   497281139942736
   418469161038160   418469161038160   417908282300720   417908282300720
   550962358819680   550962358819680   555512823424000   555512823424000
   447356352104528   447356352104528   452646448263472   452646448263472
   401738136831792   401738136831792   405814754050064   405814754050064

root@6c3d05276fb0:~/x# ln -sr /usr/lib/llvm-7/lib/libgomp.so /usr/lib/llvm-7/lib/libgomp.so.1

root@6c3d05276fb0:~/x# OMP_NUM_THREADS=2 MKL_THREADING_LAYER=intel MKL_NUM_THREADS=2 LD_LIBRARY_PATH=/usr/lib/llvm-7/lib/ sh x.sh
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000
   333338333350000   333338333350000   333338333350000   333338333350000

It turns out that the incorrect matrix product is a result of
gomp + iomp library clash: octave is linked against the GNU OMP,
while libmkl-rt.so invokes Intel(LLVM) OMP by default.

@Sylvestre Do you have any idea about measures to avoid gomp/iomp clash?
Although people should keep in mind to avoid mixing the usage of gomp
and iomp together, the matrix product error just happend silently
without any notice...

> Basically you're suggesting that Octave's basic matrix multiplication
> functionality is utterly broken, without anybody else noticing. This is
> highly unlikely.
>
> Did you try to reproduce the problem on a pristine sid chroot, or
> another system?

I think any BLAS implementation using iomp would end up with such error.
And unfortunately there is only MKL doing that.
 
I confirm that this problem is reproducible, as long as you make
gomp and iomp clash.

OpenBLAS/Netlib/BLIS are innocent even if I encountered error with
them, because LAPACK still points to libmkl-rt, which eventually
leads to gomp-iomp clash again. (I finally found the answer)

So ... How do I fix such gomp-iomp clashing issue?
(I guess it's not fixable)


Reply to: