(2.1.7) Continual Flow Pipelines

Srikanth T Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, Mike Upton, Continual flow pipelines. ASPLOS 2004. ACM DL Link

High system throughput

Many cores per chip

constant chip size and power envelops > large cores cannot fit on a single chip/small cores traditionally do not provide high single-thread performance

CFP : efficiently manages cycle-critical resources > continues processing instr even in the presense of a long latency data cache miss to memory.

CPF does :

draining out long-latency-dependent slice (along with ready source values) while releasing scheduler entries and registers

reacquiring there resources upon reinsertion into the pipeline

integrating results of independent instr without their re-execution

splits instr dependent on miss different from free to go

decoupling critical resources

Scheduler

take out L2 misses and store them in a temp small first-in, first out FIFO Slice Data Buffer

Completed source register : instr that have executed (to be read by slice) > save it in the FIFO

Dependent destination registers : destination operand of slice instr

key actions

identify and drain the slice

process and retire indep instr

exec the slice when miss returns

merge results of indep instr and slice instr > not difficult, the instrs are anyway independent

Perf

isolated cache misses handled effectively

memory level parallelism by going ahead after a cache miss