[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Announcing cdrskin-0.7.2



| From: Thomas Schmitt <scdbackup@gmx.net>

| Some of my own experiments yielded surprising
| setbacks. E.g. i replaced
|    gfpow[44 - i]
| by
|    h45[i]
| with a suitable constant array h45[].
| This was 7 percent slower !
| (I suspect a less fortunate cache situation.)

Right.  gcc can fold the constant 44 into indexing.  You might have
saved a unary - operation, but I don't know.  But, as you suggest, an
extra burden on a cache may be a problem.

Even worse: the price of a cache burden depends on the CPU's cache
implementation and size so testing on one machine does not give a fair
overview of performance on other machines.

| > In burn_rspc_div, you return -1 if the division is by 0.
| 
| This has been replaced by a specialized
| burn_rspc_div_3() which divides by (x^1+1).
| Less ifs, less array lookups, but no speed-up:
| 
| /* Divides by polynomial 0x03. Derived from burn_rspc_div() */
| static unsigned char burn_rspc_div_3(unsigned char a)
| {
|         if (a == 0)
|                 return 0;
|         if (gflog[a] >= 25)
|                 return gfpow[gflog[a] - 25];
|         else
|                 return gfpow[230 + gflog[a]];
| }  

Given that gfpow is doubled, this code should be faster and simpler:

| {
|         if (a == 0)
|                 return 0;
|         /* Note: gflog(x^1 + x^0) == 25 */
|         return gfpow[(255 - 25) + gflog[a]];
| }  

I think that 0x03 has a multiplicative inverse too.  This would allow
a division to be replaced by a multiplication.

Since fglog(0x03) is 25, the multiplicative inverse ought to have a
gflog of -25 == 230.  It turns out that gflog(244) == 230 so 244 ==
0xF4 appears to be the multiplicative inverse.  It surprises me that
0xF4 + 0x03 == 0xFF (for either kind of +!).

So, not too surprisingly, 255-25 could be replaced by 0+230.  No
advantage, just interesting.  It still requires the gfpow table to be
doubled.

| I trust in gcc -O2 that it handles the double
| lookup of gfpow[a] properly.

Probably.

| The code swallowed far more obvious workload
| improvements without showing speed reactions.

Right.  I'm shooting in the dark given that I'm not testing let alone
measuring.

| ------------------------------------------------
| 
| I see some potential in parallelization.
| We have at least 32 bit for exor operations.
| There are two neighbored bytes multiplied by
| the same byte simultaneously.
| 
| But already now a 1000 MHz CPU can easily feed
| a 48x CD stream. I am not aware of faster CD
| media. And this stuff is for CD only.

OK.

Any further improvement should probably be guided by measuring for hot
spots.


Reply to: