Re: How to write optimized code for an instruction set not supported by my computer?

To: debian-user@lists.debian.org
Subject: Re: How to write optimized code for an instruction set not supported by my computer?
From: Mario Castelán Castro <marioxcc.MT@yandex.com>
Date: Fri, 13 Nov 2015 18:08:39 -0600
Message-id: <[🔎] 56467B87.3090804@yandex.com>
In-reply-to: <[🔎] 563FD2B5.4060702@yandex.com>
References: <[🔎] 563BD91E.6050902@yandex.com> <[🔎] 563FD2B5.4060702@yandex.com>

tomas@tuxteam.de writes:

I see. But a soft emulation won't give you an idea of performance
anyway? Just thinking about the whole mess from caching down to
instruction set (all of which the emulator has wildly different
timings for)... I'd guess that the single/multi-thread issue is
just a ripple in a sea of uncertainty.

I think expecting just a guess for the timings from an emulator
(at least at this level) is too much. You'd be better off with
your back-of-theenvelope calculations (and then testing, once you
get your hands on "real" hardware).


I agree. Thanks for pointing this reasoning.

monnier@iro.umontreal.ca writes:

I think the question was: what makes you think AVX will improve
the performance of *your* code?  Base64 encoding/decoding should be
completely bandwidth-constrained, so it seems very unlikely that AVX
could make much of a difference.

Maybe it's bandwidth constrained; I can't tell beforehand (and I don'tthink you can either); I could only said that with some certainty afterdoing tests.

I did some limited testing but not enough yet. Depending on the testingmethod and the specific Base64 implementation, memcpy is significantlyfaster than a typical lookup table in memory implementation of Base64;indicating that computation has a non-negligible role in performance(opposed to being memory constrained).

Answering your question: What makes me think that AVX, SSE, or similarSIMD instruction sets will improve the performance of my code is:

[1] SIMD instructions are more efficient for copying memory because theyhave less dispatch overhead since they copy in bigger blocks. memcpyusually takes advantage of that; so there is a benefit in the case thatthe problem is bandwidth constrained.

[2] Although Base64 is usually implemented with a lookup table, theencoding can be performed by relatively simple arithmeticalcomputations, because the mapping can be described by bit expansion (6bits to 8) and mapping a few continuous input ranges to output ranges.For example: 0 to 25 are mapped to 'A' (ASCII 65) to 'B' (ASCII 90).

[3]: A lookup table implementation access the input data *and* thelookup table; replacing the lookup table with SIMD arithmetic reducesthe demand on memory (including cache) throughput.

I do not claim to be certain that this will improve performance, butthere is a very good possibility that it does, and I will know (in myparticular case, for my particular CPU) after completing my implementation.


Regards.

Reply to:

Follow-Ups:
- Re: How to write optimized code for an instruction set not supported by my computer?
  - From: Joel Rees <joel.rees@gmail.com>

References:
- How to write optimized code for an instruction set not supported by my computer?
  - From: Mario Castelán Castro <marioxcc.MT@yandex.com>
- Re: How to write optimized code for an instruction set not supported by my computer?
  - From: Mario Castelán Castro <marioxcc.MT@yandex.com>

Prev by Date: Re: how execute a script
Next by Date: Re: problem e-mailing debian groups
Previous by thread: Re: How to write optimized code for an instruction set not supported by my computer?
Next by thread: Re: How to write optimized code for an instruction set not supported by my computer?
Index(es):
- Date
- Thread