[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: how best to package when using hardware vectorization with vector-unit specific code?



Am 10.05.2017 um 19:42 schrieb Wookey:
On 2017-05-10 18:01 +0200, Kay F. Jahnke wrote:
#! /bin/bash

for instruction_set in mmx sse sse2 sse3 ssse3 sse4 sse4a sse4.1 sse4.2 avx
avx2 avx512f avx512pf avx512er avx512cd
do
  if [[ $( lscpu | grep $instruction_set ) ]]
  then
    bestarch=$instruction_set
  fi
done

Because it is install-time, not run-time, detection it would go wrong
in a range of circumstances, so is frowned-upon. (Installing images,
hardware which gets upgraded, keeping the OS image, cross-installing,
NFS-mounting, containers etc).

Okay, I did not think of that. Kind of a show-stopper for my simple-minded plan.

But yes, it is possible in the absence of more correct solutions. It
would be much better to run such a 'choose-binary' script at runtime
and have it run the right one as that would work in all the
circumstances I can think of offhand.

So why don't I use a run-time chooser then? I am currently doing that with the shell script above, simply passing on all arguments to a call to myprogram_$bestarch. Of course this would have to be extended to be more comprehensive, but it could always fall back on the scalar variant if it can't positively identify a friendly environment. Alternatively I could have C++ code doing the job. What's better? Can I rely on a specific shell to be present on all systems debian runs on, and on lscpu? Or is there possibly even a ready-made solution just for the purpose?

How fat would 15 versions of the program be (on x86)? Do you really
need all 15? Might a subset suffice.,

Not really 15, I think even four would be good enough - if the processor doesn't even have SSE it's a bit slow for that kind of application anyway, so I'd say at least SSE, AVX, and AVX2, plus the scalar version as a runs-everywhere fallback. And the code itself is slim; I prefer to link libVc.a in statically for performance reasons, but SFML and vigra can be linked dynamically. The binaries are ca. 1MB each.

Where should the architecture-dependent binaries go in the target's file system, to make sure they're not in the execution path accidentally?

Does this software only work on x86 or does it work on other
architectures, with other vector units (neon, altivec)? Remember to
consider more than just x86 when pondering this issue.

I am using Vc, so whatever Vc supports, my software supports as well. Vc is
a generic C++ library to abstract away the architecture.  I've coded so that
my program will also run without using the vector units

OK. Looks like neon support is 'in development'. And you can run on
non-vectorised hardware (but only very slowly).

In fact non-vectorized performance isn't all that bad, the program is very memory-bound with lots of DDA and irregular, possibly widely scattered memory access patterns. Vectorization speeds up the processing pipelines only - AVX2 roughly halves my rendering times.

Kay


Reply to: