Re: [xine-user] [ANN] PowerPC Assembly Patch
On May 25 2002, Andrew Patrikalakis wrote:
> With all the recent talk of use of assembly on the PowerPC, I came
> up with a patch to use assembly versions of memcpy. It's about 35%
> faster. Here is a sample of the memcpy speed test (which also now
> works):
You're going to die laughing. I beat this by 8% in plain C,
without using 64-bit at all. :-) It's kind of portable even.
> Benchmarking memcpy methods (smaller is better):
> glibc memcpy() : 136
> ppcasm_memcpy() : 137
> ppcasm_cacheable_memcpy() : 88
> xine: using ppcasm_cacheable_memcpy()
> (The lower time resolution is because I'm using times(NULL) in rdtsc())
This is in MB/s on a 450MHz MPC7400, with a granularity of 8.
You could mostly ignore the low scores, assuming that the
code got unlucky with something. I recompiled for every test,
and did a bit of web browsing in between every few tests.
I was careful to initialize the data first; failure to do
so would mean reading from the zero page.
glibc: 96, 96,104,104,104,104,104,112,112,112,112
kernel: 104,104,104,112,120,120,128,128,128,128,128
c2_flt: 112,120,120,120,120,120,128,136,144,144,144
c_flt: 88, 88, 88, 96,104,104,104,112,112,112,112
c_dbl: 152,152,152,152,152,168,168,168,168,168,184
c2_dbl: 120,136,144,144,152,152,152,160,160,160,168
glibc is just that
kernel is the assembly code that was posted
c2_flt is the code below
c_flt is like c2_flt, but normal 0,1,2,3,4,5... order
c2_dbl is like c2_flt, but with type "double"
c_dbl is like c_flt, but with type "double"
For the old bus, decimal MB/s copied should be 3.2 times the
bus speed if you don't count both loads and stores. If you
have a "G4" on the Max bus, it should be 4x bus speed minus
a tiny bit of overhead for occasional load/store turnaround.
So unless something is wrong with Mac motherboards, none of
these methods are anywhere near the limit.
Command line:
gcc -Wall -O2 mem.c kern.S && ./a.out
gcc version:
Reading specs from /usr/lib/gcc-lib/powerpc-linux/2.95.4/specs
gcc version 2.95.4 20011006 (Debian prerelease)
////////////////////////////////////////////////////////////////////////
static void c2_flt_memcpy(void *dst, const void *src, size_t n){
float r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,ra,rb,rc,rd,re,rf;
int i=n/(16*4); /* 16 is loop unroll factor, 4 is sizeof float */
float *sp = (float*)src - 16;
float *dp = (float*)dst - 16;
while(i--){
sp += 16;
r0 = sp[0];
r8 = sp[8];
r1 = sp[1];
r9 = sp[9];
r2 = sp[2];
ra = sp[10];
r3 = sp[3];
rb = sp[11];
r4 = sp[4];
rc = sp[12];
r5 = sp[5];
rd = sp[13];
r6 = sp[6];
re = sp[14];
r7 = sp[7];
rf = sp[15];
dp += 16;
dp[ 0] = r0;
dp[ 8] = r8;
dp[ 1] = r1;
dp[ 9] = r9;
dp[ 2] = r2;
dp[10] = ra;
dp[ 3] = r3;
dp[11] = rb;
dp[ 4] = r4;
dp[12] = rc;
dp[ 5] = r5;
dp[13] = rd;
dp[ 6] = r6;
dp[14] = re;
dp[ 7] = r7;
dp[15] = rf;
}
}
////////////////////////////////////////////////////////////////////////
--
To UNSUBSCRIBE, email to debian-powerpc-request@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Reply to: