[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How to write optimized code for an instruction set not supported by my computer?



El 13/11/15 a las 19:32, Joel Rees escribió:
Do you use the -S option to get assembly language output to look at?

Yes.

Yeah, SIMD instructions can also be used as a rather expensive
substitute for pure DMA. Expensive because they are used for more than
pure data move.

What do you mean by "pure DMA" in this context?. If you mean "direct memory access", that is a concept that applies to peripherals who access memory directly, not the CPU.

Those computations aren't as simple as you might think. Quite a bit of
branching, and the instruction cache having to restart, etc. It's
defnitely not SIMD, unless your SIMD has table lookup that can operate
independently in each data stream.

The computation does not require any branching (looping throughout the array not counted as part of the computation). Conditional expressions are evaluated using masks. I have already implemented it that way processing 48 input bits at a time that expand to 64 output bits, using 64 bits integers. It's like SSE or AVX, but the distinction between vector elements is done completely in software; as far as hardware is concerned it's 64 bits logic and arithmetic. This implementation is like a "prototype" that I made so that I could actually evaluate how complex it is. It is simpler than I thought; with SSE or AVX it will be simpler still, since they have the separation between vector elements built in and have instructions dedicated to making masks and shuffling data.

Also, SSE can process 96 bits of input that expand to 128 bits of output at a time (more with wider AVX versions). If it takes 16 instructions to process one such block (and I predict that it will take less), that is only 1 instruction per byte of output, to be compared to a lookup table implementation that has a lower bound of 1 logical operation (a shift) AND 1 byte memory access per byte of output. This provides a very rough comparison of complexity (in terms of numbers of instructions) and performance to address your concern about how complex the SIMD implementation is.

As for "instruction cache having to restart": I have never read that in any optimization or ISA manual for the case of SSE or AVX; that seems to be a misconception of you. Can you point to a source?.

[3]: A lookup table implementation access the input data *and* the lookup
table;

That's a really small table. Thirty years ago, it would have caused
cache thrashing. Not so much now.

It is not about cache size but the required cache throughput. Using a lookup table means more reads than an arithmetical implementation (like I said: it requires reading the input data *and* the table), and data has to be read and written in very small blocks (1 byte for the lookup table, 6 bytes at most for the input data), which is generally slower than using bigger blocks.

replacing the lookup table with SIMD arithmetic reduces the demand on
memory (including cache) throughput.

Does the AVX have that kind of table access? If you have a url for the
relevant specs, I'd be interested.

AVX has gather support. You can look up the details in the Intel's or AMD's ISA manuals. However, _note that I am *not* talking about accessing a table with SIMD instructions._ I am completely replacing the lookup table with arithmetic done using SIMD instructions.


Reply to: