[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Dupal Opteron on Sarge



Ron Johnson wrote:

On Sun, 2005-09-18 at 13:18 +1000, Hamish Moffatt wrote:
On Sat, Sep 17, 2005 at 06:49:55PM -0700, lordSauron wrote:
pentium 4s use a 21 stage pipeline or something like that... so they
take approximately 21 clock cycles to get anything done.  AMD uses
about 7 stages (or something in that neighbourhood) so if you divide
2.8 by 21 and 2.0 (my Athlon64) by 7, you get a really interesting
breakdown.  You'll certainly find a HUGE increase in performance,
That's a terrible simplification. Yes, it takes longer to get the first

Not only is it a simplification, it's wrong.

result (21 cycles versus 7) but the idea of the pipeline is that you can
get a result every clock cycle after that.

But when you context-switch or branch, the pipeline gets dirty,
and the new process needs to fill up the pipeline.

Short pipelines like in Athlon & G4 are easier on branching,
but other techniques like speculative fetching and OOE mitigate
that somewhat.

And then, deep pipelines let you ramp up the clock much easier
than do short pipelines.  Don't know why, though.

You can ramp up the clock speed on a deep pipeline, because
each stage in the pipeline do very little.  Therefore, it can be done
faster than the more complex stages found in a shorter pipeline
doing the same job.

Unfortunately, ability to ramp up the clock doesn't help when
a faster clock becomes necessary just to keep up.  We all know
how an opteron beats any pentium with the _same_ clock rating
by a wide margin.

A 21-stage pipeline _is_ too deep - on average, x86 instructions
have a branch for every 7th instruction, and each branch
may invalidate that deep pipeline.  A shorter pipeline have less of
these problems.

                                    The latency is higher but the
throughput is also higher (more clock cycles per second).
Yes, except when troughput is ruined by latency from all those
branches.  People try fixing this by unrolling loops, but then
they use up the instruction cache instead.  A good pipeline should
be short enough to be (almost) full most of the time.  Otherwise
it is going to loose, no matter what clock speed.

Helge Hafting



Reply to: