Re: How to write optimized code for an instruction set not supported by my computer?

To: debian-user@lists.debian.org
Subject: Re: How to write optimized code for an instruction set not supported by my computer?
From: Mario Castelán Castro <marioxcc.MT@yandex.com>
Date: Sat, 14 Nov 2015 09:22:01 -0600
Message-id: <[🔎] 56475199.2010509@yandex.com>
In-reply-to: <[🔎] CAAr43iP1bFTgiRoArDFxhUywZ9wo7aqw=CRvqSogdadxMsQ3Sw@mail.gmail.com>
References: <[🔎] 563BD91E.6050902@yandex.com> <[🔎] 563FD2B5.4060702@yandex.com> <[🔎] 56467B87.3090804@yandex.com> <[🔎] CAAr43iP1bFTgiRoArDFxhUywZ9wo7aqw=CRvqSogdadxMsQ3Sw@mail.gmail.com>

El 13/11/15 a las 19:32, Joel Rees escribió:

Do you use the -S option to get assembly language output to look at?


Yes.

Yeah, SIMD instructions can also be used as a rather expensive
substitute for pure DMA. Expensive because they are used for more than
pure data move.

What do you mean by "pure DMA" in this context?. If you mean "directmemory access", that is a concept that applies to peripherals who accessmemory directly, not the CPU.

Those computations aren't as simple as you might think. Quite a bit of
branching, and the instruction cache having to restart, etc. It's
defnitely not SIMD, unless your SIMD has table lookup that can operate
independently in each data stream.

The computation does not require any branching (looping throughout thearray not counted as part of the computation). Conditional expressionsare evaluated using masks. I have already implemented it that wayprocessing 48 input bits at a time that expand to 64 output bits, using64 bits integers. It's like SSE or AVX, but the distinction betweenvector elements is done completely in software; as far as hardware isconcerned it's 64 bits logic and arithmetic. This implementation is likea "prototype" that I made so that I could actually evaluate how complexit is. It is simpler than I thought; with SSE or AVX it will be simplerstill, since they have the separation between vector elements built inand have instructions dedicated to making masks and shuffling data.

Also, SSE can process 96 bits of input that expand to 128 bits of outputat a time (more with wider AVX versions). If it takes 16 instructions toprocess one such block (and I predict that it will take less), that isonly 1 instruction per byte of output, to be compared to a lookup tableimplementation that has a lower bound of 1 logical operation (a shift)AND 1 byte memory access per byte of output. This provides a very roughcomparison of complexity (in terms of numbers of instructions) andperformance to address your concern about how complex the SIMDimplementation is.

As for "instruction cache having to restart": I have never read that inany optimization or ISA manual for the case of SSE or AVX; that seems tobe a misconception of you. Can you point to a source?.

[3]: A lookup table implementation access the input data *and* the lookup
table;


That's a really small table. Thirty years ago, it would have caused
cache thrashing. Not so much now.

It is not about cache size but the required cache throughput. Using alookup table means more reads than an arithmetical implementation (likeI said: it requires reading the input data *and* the table), and datahas to be read and written in very small blocks (1 byte for the lookuptable, 6 bytes at most for the input data), which is generally slowerthan using bigger blocks.

replacing the lookup table with SIMD arithmetic reduces the demand on
memory (including cache) throughput.


Does the AVX have that kind of table access? If you have a url for the
relevant specs, I'd be interested.

AVX has gather support. You can look up the details in the Intel's orAMD's ISA manuals. However, _note that I am *not* talking aboutaccessing a table with SIMD instructions._ I am completely replacing thelookup table with arithmetic done using SIMD instructions.

Reply to:

Follow-Ups:
- Re: How to write optimized code for an instruction set not supported by my computer?
  - From: Joel Rees <joel.rees@gmail.com>

References:
- How to write optimized code for an instruction set not supported by my computer?
  - From: Mario Castelán Castro <marioxcc.MT@yandex.com>
- Re: How to write optimized code for an instruction set not supported by my computer?
  - From: Mario Castelán Castro <marioxcc.MT@yandex.com>
- Re: How to write optimized code for an instruction set not supported by my computer?
  - From: Mario Castelán Castro <marioxcc.MT@yandex.com>
- Re: How to write optimized code for an instruction set not supported by my computer?
  - From: Joel Rees <joel.rees@gmail.com>

Prev by Date: Re: sudo does not respond to settings in /etc/sudoers
Next by Date: Re: Iceweasel + NoScript: Google search results href anomaly
Previous by thread: Re: How to write optimized code for an instruction set not supported by my computer?
Next by thread: Re: How to write optimized code for an instruction set not supported by my computer?
Index(es):
- Date
- Thread