[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How to write optimized code for an instruction set not supported by my computer?

On Sat, Nov 14, 2015 at 9:08 AM, Mario Castelán Castro
<marioxcc.MT@yandex.com> wrote:
> [...]
> monnier@iro.umontreal.ca writes:
>> I think the question was: what makes you think AVX will improve
>> the performance of *your* code?  Base64 encoding/decoding should be
>> completely bandwidth-constrained, so it seems very unlikely that AVX
>> could make much of a difference.
> Maybe it's bandwidth constrained; I can't tell beforehand (and I don't think
> you can either);

Some of us do have experience working with base64.  :)

> I could only said that with some certainty after doing
> tests.
> I did some limited testing but not enough yet. Depending on the testing
> method and the specific Base64 implementation, memcpy is significantly
> faster than a typical lookup table in memory implementation of Base64;

memcpy() is a special case. There is usually no computation in the
inner loop(s), although there will often be some carefully crafted
setup and/or loop unrolling, often hand-written to avoid vagueries of
compiler optimimization (and resulting bad interaction with cache
controllers, etc.). (On some CPUs, there is no inner loop, memcpy
being handed off to DMA, but that also implies some scheduling policy
relative to the DMA controller hardware.)

Do you use the -S option to get assembly language output to look at?

> indicating that computation has a non-negligible role in performance
> (opposed to being memory constrained).
> Answering your question: What makes me think that AVX, SSE, or similar SIMD
> instruction sets will improve the performance of my code is:
> [1] SIMD instructions are more efficient for copying memory because they
> have less dispatch overhead since they copy in bigger blocks. memcpy usually
> takes advantage of that; so there is a benefit in the case that the problem
> is bandwidth constrained.

Yeah, SIMD instructions can also be used as a rather expensive
substitute for pure DMA. Expensive because they are used for more than
pure data move.

> [2] Although Base64 is usually implemented with a lookup table, the encoding
> can be performed by relatively simple arithmetical computations, because the
> mapping can be described by bit expansion (6 bits to 8) and mapping a few
> continuous input ranges to output ranges. For example: 0 to 25 are mapped to
> 'A' (ASCII 65) to 'B' (ASCII 90).

Those computations aren't as simple as you might think. Quite a bit of
branching, and the instruction cache having to restart, etc. It's
defnitely not SIMD, unless your SIMD has table lookup that can operate
independently in each data stream.

> [3]: A lookup table implementation access the input data *and* the lookup
> table;

That's a really small table. Thirty years ago, it would have caused
cache thrashing. Not so much now.

> replacing the lookup table with SIMD arithmetic reduces the demand on
> memory (including cache) throughput.

Does the AVX have that kind of table access? If you have a url for the
relevant specs, I'd be interested.

> I do not claim to be certain that this will improve performance, but there
> is a very good possibility that it does, and I will know (in my particular
> case, for my particular CPU) after completing my implementation.
> Regards.

Joel Rees

Be careful when you look at conspiracy.
Arm yourself with knowledge of yourself, as well:

Reply to: