Hi Michael,
Specifically I'm interested in seeing more of our packages for the latest RaspberryPI systems (arm64).
Likewise I am interested in testing things on arm. However, remote debugging is not a lot of fun and I am busy writing my dissertation anyways.
Two downsides of using the SIMDE library:1) Doesn't work with raw assembly, only C/C++ compiler intrinsics (<emmintrin.h> and friends)
I don't see this as a downside. Embedding your intrinsics into
the regular source will enable more optimizations for the
compiler.
2) Switching between different types of SIMD (like using SSE fallbacks for an SSE2 operation) is done at compile time and not run time.
This is a bummer, but can be solved (see below).
Questions for you all:1) Is this a good idea?
I think it is a good idea, iff you have a benchmark proving that
the optimizations will improve the runtimes significantly. For
instance, there are a number of different ways to compute the
reverse complement. Using a switch statement is very slow, a table
is ten times faster, a simd approach can even give another 7x
speedup [3].
2) Should we carry these patches if upstream doesn't accept them?
Dunno.
3) Any ideas about compiling with different -m{avx2,avx,sse4.2,sse4.1,ssse3,sse3,sse2,sse,mmx} settings + simple wrapper generation to pick the right executable?
I did that just recently for phylonium [1]. Here is the best approach I found: Have each optimized function in a separate file. Compile each with its specific -m setting. Further provide a generic implementation as well as one entrypoint function. The latter can then at call time determine which optimized implementation to use via __builtin_cpu_supports(). Using ifuncs this can even be delegated to dynlink-time.
The devil is in the details: hurd and kfreebsd (and macOS) don't
support ifuncs [2]. __builtin_cpu_supports() needs some help to
work in ifuncs. Also you have to disable the shenanigans for
non-x86/whatever platforms.
I definitely think that a library is the right place for these
optimizations. (That's one of the reasons I started my libdna
project.) If you want to optimize libssw you can try using my
approach and see how far it get's you. ☺
Best
Fabian
1: https://salsa.debian.org/med-team/phylonium/libs/
2: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=945133
3:
https://github.com/kloetzl/libdna/blob/master/bench/Brevcomp.cxx