IBM POWER9 SIMD support? (Was: SIMDebian: ...)
Hi Mo,
I love your foot notes.
My understanding is IBMs "POWER9" CPUs
1.) have SIMD instructions[1] and
2.) are used by the new, and very cool, open
*hardware* Talos II workstations[2], which
3.) already run Debian.
FYI,
Kingsley
[1] POWER9
https://en.wikipedia.org/wiki/POWER9
[2] Introducing Talos II
https://www.raptorcs.com/TALOSII/
On 02/08/2019 16:25, Mo Zhou wrote:
> Hi folks,
>
> For most programs the "-march=native" option is not expected to bring any
> significant performance improvement. However for some scientific applications
> this proposition doesn't hold. When I was creating the tensorflow debian
> package, I observed a significant performance gap between generic code and
> kabylake (Intel 7XXX Series) code[1].
>
> The significant improvement in performance basically stems from the Eigen
> library (header only numerical linear algebra library). Here is a simple
> example[2] for demonstrating the performance gap[3] between different ISA
> baselines. (elapsed time is roughly measured with "perf stat ...")
>
> Having seen such interesting results, I immediately created a Debian partial
> fork named SIMDebian (SIMD + Debian)[0]. It makes great sense to some
> applications due to the significant performance gain brought by SIMD code.
> Currently this partial fork is still in the very early stage, and it needs
>
> * More experience about software that benefit a lot from SIMD code
> (e.g. What package would potentially benefit from SIMD code?)
> * Suggestions and comments
> (e.g. Is such a partial fork really useful and valuable?)
> * More people interested in this
>
> SIMDebian is only a PARTIAL fork, which means that it only takes care of
> packages that would obviously benefit from SIMD code, because no performance
> gain is expected in terms of the majority of packages in the Debian archive.
>
> Generally speaking, in order to bump the ISA baseline for a given package, one
> could add the -march=xxx flag to {C,CXX,F}FLAGS by modifying debian/rules.
> However SIMDebian employes a more economic approach to this end: forking
> dpkg[5] and injecting -march=xxx flag to the system default flag list. With the
> resulting dpkg package, most debian packages could be rebuilt with bumped ISA
> baseline without any code modification.
>
> I think Debian Science team is interested in this partial fork as well. In the
> past there was a highly-related GSoC project[4] (In my fuzzy memory the topic
> lead to the creation of the GSoC project was raised by me). However for some
> reason (I forgot it) it didn't start.
>
> This is the first time I try to fork Debian and apparently I have no experience
> on running a fork. I need comments from especially the Debian Science Team.
> Any response/pointer would be much appreciated!
>
> P.S. SIMDebian has an alias: SIGILLbian (SIGILL + Debian).
> -------------------------------------------------------------------------------
>
> [0] https://github.com/SIMDebian/SIMDebian
>
> [1] https://github.com/SIMDebian/SIMDebian/blob/master/benchmarks/tensorflow.md
>
> [2] ```c++
> #include <iostream>
> #include <Eigen/Dense>
> using namespace std;
>
> #define N 4096
> int main(void)
> {
> auto A = Eigen::MatrixXd::Random(N, N);
> auto B = Eigen::MatrixXd::Random(N, N);
> auto C = A * B;
> //cout << A << endl << B << endl << C << endl;
> (void) C(0,0);
> return 0;
> }
> ```
>
> [3] ``` (command-line) (perf-stat-elapsed-time)
> CPU: Intel I5-7440HQ
>
> g++ a.cc -I/usr/include/eigen3 -O2 -march=skylake \
> -DEIGEN_USE_MKL_ALL -I/usr/include/mkl -lmkl_rt
> 1.275162977 (seconds)
>
> g++ a.cc -I/usr/include/eigen3 -O2 \
> -DEIGEN_USE_MKL_ALL -I/usr/include/mkl -lmkl_rt
> 1.382608279
>
> g++ a.cc -I/usr/include/eigen3 -O2 -march=skylake -fopenmp
> 1.460047514
>
> g++ a.cc -I/usr/include/eigen3 -O3 -march=skylake -fopenmp
> 1.313478657
>
> g++ a.cc -I/usr/include/eigen3 -O2 -march=haswell -fopenmp
> 1.334523068
>
> g++ a.cc -I/usr/include/eigen3 -O2 -march=sandybridge -fopenmp
> 1.988947143
>
> g++ a.cc -I/usr/include/eigen3 -O2 -march=nehalem -fopenmp
> 3.099827038
>
> g++ a.cc -I/usr/include/eigen3 -O2 -march=x86-64 -fopenmp
> 3.106337852
>
> However, please note that Eigen's fastest result is still much slower
> than OpenBLAS, even if Eigen called MKL:
>
> ~ ❯❯❯ julia -e 'A = rand(Float64, 4096, 4096); A*A; @time A*A;'
> 1.011168 seconds (6 allocations: 128.000 MiB, 2.69% gc time)
>
> BLAS optimization is another story. Omitted here.
> ```
>
> [4] https://wiki.debian.org/SummerOfCode2017/Projects/Benchmarking
>
> [5] https://github.com/SIMDebian/dpkg
> Currently this fork aims on "haswell" due to availability of AVX2.
> Only minor modification on my patch is reqired to further bump the
> baseline to e.g. icelake (AVX512).
>
--
Time is the fire in which we all burn.
Reply to: