[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How to write optimized code for an instruction set not supported by my computer?



tomas@tuxteam.de writes:
I see. But a soft emulation won't give you an idea of performance
anyway? Just thinking about the whole mess from caching down to
instruction set (all of which the emulator has wildly different
timings for)... I'd guess that the single/multi-thread issue is
just a ripple in a sea of uncertainty.

I think expecting just a guess for the timings from an emulator
(at least at this level) is too much. You'd be better off with
your back-of-theenvelope calculations (and then testing, once you
get your hands on "real" hardware).

I agree. Thanks for pointing this reasoning.

monnier@iro.umontreal.ca writes:
I think the question was: what makes you think AVX will improve
the performance of *your* code?  Base64 encoding/decoding should be
completely bandwidth-constrained, so it seems very unlikely that AVX
could make much of a difference.

Maybe it's bandwidth constrained; I can't tell beforehand (and I don't think you can either); I could only said that with some certainty after doing tests.

I did some limited testing but not enough yet. Depending on the testing method and the specific Base64 implementation, memcpy is significantly faster than a typical lookup table in memory implementation of Base64; indicating that computation has a non-negligible role in performance (opposed to being memory constrained).

Answering your question: What makes me think that AVX, SSE, or similar SIMD instruction sets will improve the performance of my code is:

[1] SIMD instructions are more efficient for copying memory because they have less dispatch overhead since they copy in bigger blocks. memcpy usually takes advantage of that; so there is a benefit in the case that the problem is bandwidth constrained.

[2] Although Base64 is usually implemented with a lookup table, the encoding can be performed by relatively simple arithmetical computations, because the mapping can be described by bit expansion (6 bits to 8) and mapping a few continuous input ranges to output ranges. For example: 0 to 25 are mapped to 'A' (ASCII 65) to 'B' (ASCII 90).

[3]: A lookup table implementation access the input data *and* the lookup table; replacing the lookup table with SIMD arithmetic reduces the demand on memory (including cache) throughput.

I do not claim to be certain that this will improve performance, but there is a very good possibility that it does, and I will know (in my particular case, for my particular CPU) after completing my implementation.

Regards.


Reply to: