(2.1.7) Continual Flow Pipelines

Srikanth T Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, Mike Upton, Continual flow pipelines. ASPLOS 2004. ACM DL Link



High system throughput
     Many cores per chip
     constant chip size and power envelops > large cores cannot fit on a single chip/small cores traditionally do not provide high single-thread performance

CFP : efficiently manages cycle-critical resources > continues processing instr even in the presense of a long latency data cache miss to memory.

CPF does : 
     draining out long-latency-dependent slice (along with ready source values) while releasing scheduler entries and registers
     reacquiring there resources upon reinsertion into the pipeline
     integrating results of independent instr without their re-execution

     splits instr dependent on miss different from free to go

decoupling critical resources 
     Scheduler 
          take out L2 misses and store them in a temp small first-in, first out FIFO Slice Data Buffer
     Register file
          Completed source register : instr that have executed (to be read by slice) > save it in the FIFO
          Dependent destination registers : destination operand of slice instr 

key actions
     identify and drain the slice
     process and retire indep instr
     exec the slice when miss returns
     merge results of indep instr and slice instr > not difficult, the instrs are anyway independent

Perf
     isolated cache misses handled effectively
     memory level parallelism by going ahead after a cache miss