[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: pa-risc/linux abi



On Wed, Jun 23, 2004 at 10:14:06PM -0600, Grant Grundler wrote:
> AFAIK, the general registers (integer) have interlocks so
> the depth of the pipeline is irrelevant. I don't know if
> that's true for FP regs in general or specific CPU models.

I was able to confirm this is true for at least PA8500 CPUs (PCXW).
I expect this to be true for PA8600 and PA8700 as well.

> Normally I'd expect how many cycles it takes to complete
> a particular FOP to dictate when the FP reg (both left and right)
> can be used again as the target or source for other ops.

The same document explained the instructions are loaded in
groups of 4 from memory and decoded as a group. ie organize
the asm instructions in groups of 4 (aka "quad") with each targeting
a different parts of the the CPU (some combination of 2 loads,
2 stores, 2 FOPs, 1 shift, etc).

The PA8500 FMAC units can handle all (most?) floating-point operations
except floating-point divides and square-roots. Up to two floating-point
instructions can be launched for execution in each cycle.

In a nutshell:
o can only have 2 FP ops per quad (without stalling the quad)
o use the PA 2.0 architected prefetch in "spare" cycles (load (GRX) -> GR0)
o perform 2 or less memory loads per quad.
o do not use the target FP registers until we know the data has arrived.
o do not reference target FP registers until we know FMAC is done.

ie by hand, unroll the loop that will prefetch N+2, load N+1,
and operate on N in each quad (roughly). Use all the registers
available to unroll more than once if possible since that's what all
the registers are for. Saving/restoring registers is cheap if it
cuts the number of iterations by 2.

I was also told in each cycle the oldest data-ready floating-point
instruction occupying an odd queue-slot is selected for execution
on the FMAC unit or (an idle) FP divide/square-root unit dedicated
to odd-queue slots. Ditto for "even queu-slots". I forgot what
the even queue and odd queue slots refer too. I'm not sure if
this is related to left/right side of FP registers or the
insn address within the quad.

The FMAC units can accept a new operation every cycle.
The FP divide/square-root units are not.
FP divide/square-root insn will be delayed if the corresponding functional
unit is busy. However, "like-parity" (eg odd queue slot) FP insns can be
launched out of order to the corresponding FMAC unit.

Note this is for ONE implementation only and not architected.
But I expect at least PA8500/8600/8700 CPUs to be similar
in this regard.

hth,
grant



Reply to: