Re: Request for feedback: adding additional arch support to libssw using the SIMDE headers

To: debian-med@lists.debian.org
Subject: Re: Request for feedback: adding additional arch support to libssw using the SIMDE headers
From: Fabian Klötzl <kloetzl@evolbio.mpg.de>
Date: Thu, 12 Dec 2019 13:38:51 +0100
Message-id: <[🔎] ffe6e0fc-d261-6f2f-8715-2972281fba52@evolbio.mpg.de>
In-reply-to: <[🔎] CAD=WrcJ2mq-mHTfKBqS7bQkqRw=2=oZNT=_SUU01iS3FkyvS-g@mail.gmail.com>
References: <[🔎] CAD=WrcJ2mq-mHTfKBqS7bQkqRw=2=oZNT=_SUU01iS3FkyvS-g@mail.gmail.com>

Hi Michael,

On 12.12.19 13:05, Michael Crusoe wrote:

Specifically I'm interested in seeing more of our packages for the latest RaspberryPI systems (arm64).

Likewise I am interested in testing things on arm. However, remote debugging is not a lot of fun and I am busy writing my dissertation anyways.

Two downsides of using the SIMDE library:

1) Doesn't work with raw assembly, only C/C++ compiler intrinsics (<emmintrin.h> and friends)

I don't see this as a downside. Embedding your intrinsics into the regular source will enable more optimizations for the compiler.

2) Switching between different types of SIMD (like using SSE fallbacks for an SSE2 operation) is done at compile time and not run time.

This is a bummer, but can be solved (see below).

Questions for you all:

1) Is this a good idea?

I think it is a good idea, iff you have a benchmark proving that the optimizations will improve the runtimes significantly. For instance, there are a number of different ways to compute the reverse complement. Using a switch statement is very slow, a table is ten times faster, a simd approach can even give another 7x speedup [3].

2) Should we carry these patches if upstream doesn't accept them?

Dunno.

3) Any ideas about compiling with different -m{avx2,avx,sse4.2,sse4.1,ssse3,sse3,sse2,sse,mmx} settings + simple wrapper generation to pick the right executable?

I did that just recently for phylonium [1]. Here is the best approach I found: Have each optimized function in a separate file. Compile each with its specific -m setting. Further provide a generic implementation as well as one entrypoint function. The latter can then at call time determine which optimized implementation to use via __builtin_cpu_supports(). Using ifuncs this can even be delegated to dynlink-time.

The devil is in the details: hurd and kfreebsd (and macOS) don't support ifuncs [2]. __builtin_cpu_supports() needs some help to work in ifuncs. Also you have to disable the shenanigans for non-x86/whatever platforms.

I definitely think that a library is the right place for these optimizations. (That's one of the reasons I started my libdna project.) If you want to optimize libssw you can try using my approach and see how far it get's you. ☺

Best
Fabian

1: https://salsa.debian.org/med-team/phylonium/libs/
2: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=945133
3: https://github.com/kloetzl/libdna/blob/master/bench/Brevcomp.cxx

Reply to:

References:
- Request for feedback: adding additional arch support to libssw using the SIMDE headers
  - From: Michael Crusoe <michael.crusoe@gmail.com>

Prev by Date: Re: Request for feedback: adding additional arch support to libssw using the SIMDE headers
Next by Date: Droping Python2 support for Biopython
Previous by thread: Re: Request for feedback: adding additional arch support to libssw using the SIMDE headers
Next by thread: Droping Python2 support for Biopython
Index(es):
- Date
- Thread