[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How to write optimized code for an instruction set not supported by my computer?

On Sun, Nov 15, 2015 at 12:22 AM, Mario Castelán Castro
<marioxcc.MT@yandex.com> wrote:
> El 13/11/15 a las 19:32, Joel Rees escribió:
>> Do you use the -S option to get assembly language output to look at?
> Yes.
>> Yeah, SIMD instructions can also be used as a rather expensive
>> substitute for pure DMA. Expensive because they are used for more than
>> pure data move.
> What do you mean by "pure DMA" in this context?. If you mean "direct memory
> access", that is a concept that applies to peripherals who access memory
> directly, not the CPU.

When the CPU provides DMA instructions, it's DMA.
I don't remember how far Intel has gone with this, but many processors
implement string moves as full-fledged DMA, allowing the CPU to
continue processing other instructions in parallel.

Yeah, it's a bit tricky, and the compiler has to be a little
clairvoyant about pointers.

>> Those computations aren't as simple as you might think. Quite a bit of
>> branching, and the instruction cache having to restart,

Instruction queue. Sorry.

>> etc. It's
>> defnitely not SIMD, unless your SIMD has table lookup that can operate
>> independently in each data stream.

And I'm losing track of my thoughts there, as well.

> The computation does not require any branching (looping throughout the array
> not counted as part of the computation). Conditional expressions are
> evaluated using masks.

At the assembler level or compiler level?

> I have already implemented it that way processing 48
> input bits at a time that expand to 64 output bits, using 64 bits integers.

I can see that much for breaking the eight 6-bit fields out of a pair
of bytes, but the other stuff is puzzling me. In particular, pulling
six bytes out of memory is a bit field extraction, which is probably
not going to use AVX or whatever. And the parallel conversion from
binary to ascii is going to have half the processing units idle about
half the time in SIMD, and you can't predict which will be idle.

Idling processing units is not always bad, of course, if speeding up
the process is worth the cost of not letting the SIMD hardware work on
other things.

The reverse conversion has exceptional conditions, so it can only get
harder to parallelize.

> It's like SSE or AVX, but the distinction between vector elements is done
> completely in software;

So you are talking C level code?

> as far as hardware is concerned it's 64 bits logic
> and arithmetic. This implementation is like a "prototype" that I made so
> that I could actually evaluate how complex it is. It is simpler than I
> thought; with SSE or AVX it will be simpler still, since they have the
> separation between vector elements built in and have instructions dedicated
> to making masks and shuffling data.

If you know this much about the instructions, why are you wondering
about emulation? Why not just get (access to) the hardware?

> Also, SSE can process 96 bits of input that expand to 128 bits of output at
> a time (more with wider AVX versions). If it takes 16 instructions to
> process one such block (and I predict that it will take less), that is only
> 1 instruction per byte of output, to be compared to a lookup table
> implementation that has a lower bound of 1 logical operation (a shift) AND 1
> byte memory access per byte of output. This provides a very rough comparison
> of complexity (in terms of numbers of instructions) and performance to
> address your concern about how complex the SIMD implementation is.

And you don't actually need the hardware for this. In fact, you can
hand-assemble the SIMD instructions and get a good guess about how
well things will work.

> As for "instruction cache having to restart": I have never read that in any
> optimization or ISA manual for the case of SSE or AVX; that seems to be a
> misconception of you. Can you point to a source?.

Again, sorry for the brain gas. Instruction queue. Pipeline.
Restarting the pipeline.

But, if you are using masks, not in C level code, of course, but in
assembler, you are avoiding the restarts. You just have a varying
length no-op in your code, which might be a good trade-off, especially
since it would be (in hand assembled code on a CPU that can do it)
about four of eight instructions of effective no-op on expected

>>> [3]: A lookup table implementation access the input data *and* the lookup
>>> table;
>> That's a really small table. Thirty years ago, it would have caused
>> cache thrashing. Not so much now.

Sorry about mixing my thoughts without warning.

> It is not about cache size but the required cache throughput. Using a lookup
> table means more reads than an arithmetical implementation (like I said: it
> requires reading the input data *and* the table), and data has to be read
> and written in very small blocks (1 byte for the lookup table, 6 bytes at
> most for the input data), which is generally slower than using bigger
> blocks.

I was thinking out loud about dedicatable hardware that could lock the
table in cache. Sorry.

>>> replacing the lookup table with SIMD arithmetic reduces the demand on
>>> memory (including cache) throughput.
>> Does the AVX have that kind of table access? If you have a url for the
>> relevant specs, I'd be interested.
> AVX has gather support. You can look up the details in the Intel's or AMD's
> ISA manuals. However, _note that I am *not* talking about accessing a table
> with SIMD instructions._ I am completely replacing the lookup table with
> arithmetic done using SIMD instructions.

After I posted the previous, i had vague memories of some such, but I
couldn't remember the word "gather". Heh.

Anyway, the point is that you are moving your project forward, right?

Joel Rees

Be careful when you look at conspiracy.
Arm yourself with knowledge of yourself, as well:

Reply to: