Srikanth T Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, Mike Upton, Continual flow pipelines. ASPLOS 2004. ACM DL Link |
High system throughput
Many cores per chip
constant chip size and power envelops > large cores cannot fit on a single chip/small cores traditionally do not provide high single-thread performance
CFP : efficiently manages cycle-critical resources > continues processing instr even in the presense of a long latency data cache miss to memory.
CPF does :
draining out long-latency-dependent slice (along with ready source values) while releasing scheduler entries and registers
reacquiring there resources upon reinsertion into the pipeline
integrating results of independent instr without their re-execution
splits instr dependent on miss different from free to go
decoupling critical resources
Scheduler
take out L2 misses and store them in a temp small first-in, first out FIFO Slice Data Buffer
Register file
Completed source register : instr that have executed (to be read by slice) > save it in the FIFO
Dependent destination registers : destination operand of slice instr
key actions
identify and drain the slice
process and retire indep instr
exec the slice when miss returns
merge results of indep instr and slice instr > not difficult, the instrs are anyway independent
Perf
isolated cache misses handled effectively
memory level parallelism by going ahead after a cache miss