Re: Announcing cdrskin-0.7.2
| From: Thomas Schmitt <scdbackup@gmx.net>
| Some of my own experiments yielded surprising
| setbacks. E.g. i replaced
| gfpow[44 - i]
| by
| h45[i]
| with a suitable constant array h45[].
| This was 7 percent slower !
| (I suspect a less fortunate cache situation.)
Right. gcc can fold the constant 44 into indexing. You might have
saved a unary - operation, but I don't know. But, as you suggest, an
extra burden on a cache may be a problem.
Even worse: the price of a cache burden depends on the CPU's cache
implementation and size so testing on one machine does not give a fair
overview of performance on other machines.
| > In burn_rspc_div, you return -1 if the division is by 0.
|
| This has been replaced by a specialized
| burn_rspc_div_3() which divides by (x^1+1).
| Less ifs, less array lookups, but no speed-up:
|
| /* Divides by polynomial 0x03. Derived from burn_rspc_div() */
| static unsigned char burn_rspc_div_3(unsigned char a)
| {
| if (a == 0)
| return 0;
| if (gflog[a] >= 25)
| return gfpow[gflog[a] - 25];
| else
| return gfpow[230 + gflog[a]];
| }
Given that gfpow is doubled, this code should be faster and simpler:
| {
| if (a == 0)
| return 0;
| /* Note: gflog(x^1 + x^0) == 25 */
| return gfpow[(255 - 25) + gflog[a]];
| }
I think that 0x03 has a multiplicative inverse too. This would allow
a division to be replaced by a multiplication.
Since fglog(0x03) is 25, the multiplicative inverse ought to have a
gflog of -25 == 230. It turns out that gflog(244) == 230 so 244 ==
0xF4 appears to be the multiplicative inverse. It surprises me that
0xF4 + 0x03 == 0xFF (for either kind of +!).
So, not too surprisingly, 255-25 could be replaced by 0+230. No
advantage, just interesting. It still requires the gfpow table to be
doubled.
| I trust in gcc -O2 that it handles the double
| lookup of gfpow[a] properly.
Probably.
| The code swallowed far more obvious workload
| improvements without showing speed reactions.
Right. I'm shooting in the dark given that I'm not testing let alone
measuring.
| ------------------------------------------------
|
| I see some potential in parallelization.
| We have at least 32 bit for exor operations.
| There are two neighbored bytes multiplied by
| the same byte simultaneously.
|
| But already now a 1000 MHz CPU can easily feed
| a 48x CD stream. I am not aware of faster CD
| media. And this stuff is for CD only.
OK.
Any further improvement should probably be guided by measuring for hot
spots.
Reply to: