## U. Wisconsin CS/ECE 752 Advanced Computer Architecture I Prof. David A. Wood Unit 3: Pipelining Slides developed by Amir Roth of University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. Slides enhanced by Milo Martin, Mark Hill, and David Wood with sources that included Profs. Asanovic, Falsafi, Hoe, Lipasti, Shen, Smith, Sohi, Vijaykumar, and Wood CS/ECE 752 (Wood): Pipelining



















## Pipeline Performance Calculation Back of the envelope calculation Branch: 20%, load: 20%, store: 10%, other: 50% Single-cycle Clock period = 50ns, CPI = 1 Performance = 50ns/insn Pipelined Clock period = 12ns CPI = 1 (each insn takes 5 cycles, but 1 completes each cycle) Performance = 12ns/insn

### Principles of Pipelining • Let: insn execution require N stages, each takes t<sub>n</sub> time • Single-cycle execution • L₁ (1-insn latency) = Σt<sub>n</sub> • T (throughput) = 1/L₁ • L<sub>M</sub> (M-insn latency, where M>>1) = M\*L₁ • Now: (ideal) N-stage pipeline • L₁+p = L₁ • T+p = 1/max(t<sub>n</sub>) ≤ N/L₁ • If t<sub>n</sub> are equal (i.e., max(t<sub>n</sub>) = L₁/N), throughput = N/L₁ • L<sub>M+P</sub> = M\*max(t<sub>n</sub>) ≥ M\*L₃/N • S+p (speedup) = [M\*L₁/(≥ M\*L₃/N)] = ≤ N • Q: for arbitrarily high speedup, use arbitrarily high N?

CS/ECE 752 (Wood): Pipelining

### No, Part I: Pipeline Overhead

- Let: O be extra delay per pipeline stage
  - · Latch overhead: pipeline latches take time
  - Clock/data skew
- · Now: N-stage pipeline with overhead
  - Assume max(t<sub>n</sub>) = L<sub>1</sub>/N
  - $L_{1+P+O} = L_1 + N*O$
  - $T_{+P+O} = 1/(L_1/N + O) = 1/(1/T_{+P} + O) \le T_{+P} \le T_{+P}/O$
  - $L_{M+P+O} = M*L_1/N + M*O = L_{M+P} + M*O$
  - $S_{+P+O} = [M*L_1 / (M*L_1/N + M*O)] = \le N = S_{+P}, \le L_1/O$
- O limits throughput and speedup → useful N

CS/ECE 752 (Wood): Pipelining

13

### No, Part II: Hazards

- Dependence: relationship that serializes two insns
  - Data: two insns use the same value or storage location
  - Control: one instruction affects whether another executes at all
  - · Maybe: two insns may have a dependence
- Hazard: dependence causes potential incorrect execution
  - Possibility of using or corrupting data or execution flow
- Structural: two insns want to use same structure, one must wait
   Often fixed with stalls: insn stays in same stage for multiple cycles
- Let: H be average number of hazard stall cycles per instruction

- L<sub>1-P+H</sub> = L<sub>1-P</sub> (no hazards for one instruction)

  T<sub>+P+H</sub> = [N/(N+H)]\*N/L<sub>1</sub> = [N/(N+H)]\* T<sub>+P</sub>

  L<sub>M+P+H</sub> = M\*L<sub>1</sub>/N\* {(N+H)/N} = ([N+H)/N] \* L<sub>M+P</sub>

  S<sub>+P+H</sub> = M\*L<sub>1</sub>/N\*L<sub>1</sub>/N\*((N+H)/N] = [N/(N+H)]\*S<sub>+P</sub>

  Halso limit throughput, speedup → useful N

  N 1→ H↑ (more insns "in flight" → more dependences become hazards)

  Fixet H depends on program, requires detailed circulation/model
  - Exact H depends on program, requires detailed simulation/model

CS/ECE 752 (Wood): Pipelining

### Clock Rate vs. IPC

- Deeper pipeline (bigger N)
  - + frequency1
  - IPC1
  - Ultimate metric is IPC \* frequency
    - For years Intel got people to buy frequency, not IPC \*
- · Trend was deeper pipelines, now not too deep
  - · Intel example:
    - 486: 5 stages (50+ gate delays / clock)
    - Pentium: 7 stages
    - · Pentium II/III: 12 stages
    - Pentium 4: 22 stages (10 gate delays / clock)
    - 800 MHz Pentium III was faster than 1 GHz Pentium4
    - Core2: 14 stages, less than Pentium 4
- Haswell: 14-19 depending upon uOp cache hit/miss CS/ECE 752 (Wood): Pipelining

### Optimizing Pipeline Depth

- Parameterize clock cycle in terms of gate delays
  - G gate delays to process (fetch, decode, execute) a single insn
  - · O gate delays overhead per stage
  - X average stall per instruction per stage
    - · Simplistic: real X function much, much more complex
- Compute optimal N (pipeline stages) given G,O,X
  - IPC = 1 / (1 + X \* N)
  - f = 1 / (G / N + O)

Ontimizes performancel

Example: G = 80, O = 1, X = 0.16,

| N  | IPC = 1/(1+0.16*N) | freq=1/(80/N+1) | IPC*freq |
|----|--------------------|-----------------|----------|
| 5  | 0.56               | 0.059           | 0.033    |
| 10 | 0.38               | 0.110           | 0.042    |
| 20 | 0.33               | 0.166           | 0.040    |

CS/ECE 752 (Wood): Pipelining

16

### Research: Razor regfile

- Razor [Uht, Ernst+]
  - Identify pipeline stages with narrow signal margins (e.g., X)
  - · Add "Razor" X/M latch: relatches X/M input signals after safe delay
  - · Compare X/M latch with "safe" razor X/M latch, different?
    - Flush F,D,X & M
  - Restart M using X/M razor latch, restart F using D/X latch
     Pipeline will not "break" → reduce V<sub>DD</sub> until flush rate too high
  - + Alternatively: "over-clock" until flush rate too high

CS/ECE 752 (Wood): Pipelining

### Managing a Pipeline

- Proper flow requires two pipeline operations
  - · Mess with latch write-enable and clear signals to achieve
- · Operation I: stall
  - · Effect: stops some insns in their current stages
  - · Use: make younger insns wait for older ones to complete
  - Implementation: de-assert write-enable
- Operation II: flush
  - · Effect: removes insns from current stages
  - Use: see later
- · Implementation: assert clear signals
- · Both stall and flush must be propagated to younger insns

CS/ECE 752 (Wood): Pipelining

### 



### Avoiding Structural Hazards (PRS) • Pipeline the contended resource + No IPC degradation, low area, power overheads - Sometimes tricky to implement (e.g., for RAMs) For multi-cycle resources (e.g., multiplier) Replicate the contended resource + No IPC degradation - Increased area, power, latency (interconnect delay?) • For cheap, divisible, or highly contended resources (e.g, I\$/D\$) • Schedule pipeline to reduce structural hazards (RISC) . Design ISA so insn uses a resource at most once • Eliminate same insn hazards Always in same pipe stage (hazards between two of same insn) · Reason why integer operations forced to go through M stage · And always for one cycle CS/ECE 752 (Wood): Pipelining 21



### **ISA Branch Techniques** • Fast branch: resolves at D, not X . Test must be comparison to zero or equality, no time for ALU + New taken branch penalty is 1 - Additional comparison insns (e.g., cmplt, slt) for complex tests - Must bypass into decode now, too Delayed branch: branch that takes effect one insn later • Insert insns that are independent of branch into "branch delay slot" • Preferably from before branch (always helps then) But from after branch OK too • As long as no undoable effects (e.g., a store) • Upshot: short-sighted feature (MIPS regrets it) - Not a big win in today's pipelines - Complicates interrupt handling CS/ECE 752 (Wood): Pipelining 23

```
Big Idea: Speculation

Speculation

"Engagement in risky transactions on the chance of profit"

Speculative execution
Execute before all parameters known with certainty

Correct speculation
Avoid stall, improve performance
Incorrect speculation (mis-speculation)
Must abort/flush/squash incorrect instructions
Must undo incorrect changes (recover pre-speculation state)

The "game": [%correct * gain] > [(1-%correct) * penalty]
```









### Why Does a BTB Work? Because control insn targets are stable Direct means constant target, indirect means register target Direct conditional branches? Check Direct calls? Check Direct unconditional jumps? Check Indirect conditional branches? Not that useful→not widely supported Indirect calls? Two idioms Dynamically linked functions (DLLs)? Check Dynamically dispatched (virtual) functions? Pretty much check Indirect unconditional jumps? Two idioms Switches? Not really, but these are rare Returns? Nope, but...





### Branch History Table (BHT)

- Branch history table (BHT): simplest direction predictor
  - PC indexes table of bits (0 = N, 1 = T), no tags
  - · Essentially: branch will go same way it went last time
  - Problem: consider inner loop branch below (\* = mis-prediction)

for (i=0;i<100;i++) for (j=0;j<3;j++)
 // whatever</pre>

| State/prediction | N* | Т | Т | T* | N* | Т | Т | <b>T</b> * | N* | Т | Т | T* |
|------------------|----|---|---|----|----|---|---|------------|----|---|---|----|
| Outcome          | Т  | Т | Т | N  | Т  | Т | Т | N          | Т  | Т | Т | N  |

- Two "built-in" mis-predictions per inner loop iteration
- Branch predictor "changes its mind too quickly"

CS/ECE 752 (Wood): Pipelining

32

### Two-Bit Saturating Counters (2bc)

- Two-bit saturating counters (2bc) [Smith]
  - Replace each single-bit prediction
    - (0,1,2,3) = (N,n,t,T)

CS/ECE 752 (Wood): Pipelining

• Force DIRP to mis-predict twice before "changing its mind"

| State/prediction | N* | n* | t | T* | t | Т | Т | <b>T</b> * | t | Т | Т | <b>T</b> * |
|------------------|----|----|---|----|---|---|---|------------|---|---|---|------------|
| Outcome          | Т  | T  | Т | N  | Т | Т | Т | N          | T | Т | Т | N          |

- + Fixes this pathology (which is not contrived, by the way)
- + Sometimes (wrongly) called a "Bimodal predictor" + Note that this is NOT the same as a "bi-mode" predictor

CS/ECE 752 (Wood): Pipelining

33

31

### **Correlated Predictor**

- Correlated (two-level) predictor [Patt]
  - Exploits observation that branch outcomes are correlated
  - Maintains separate prediction per (PC, BHR)
  - Branch history register (BHR): recent branch outcomes
  - · Simple working example: assume program has one branch
    - BHT: one 1-bit DIRP entry
    - BHT+2BHR: 4 1-bit DIRP entries

| State/prediction | BHR=NN | N* | Ť  | Т  | T          | Т  | Т | Т  | Т          | Т | Ť | Т  | T          |
|------------------|--------|----|----|----|------------|----|---|----|------------|---|---|----|------------|
| "active pattern" | BHR=NT | N  | N* | Т  | Т          | Т  | Т | Т  | Т          | Т | Т | Т  | Т          |
|                  | BHR=TN | N  | N  | N  | N          | N* | Т | Т  | Т          | Т | Т | Т  | Т          |
|                  | BHR=TT | N  | N  | N* | <b>T</b> * | N  | N | N* | <b>T</b> * | N | N | N* | <b>T</b> * |
| Outcome          |        | Т  | T  | Т  | N          | Т  | T | Т  | N          | Т | T | Т  | N          |

- We didn't make anything better, what's the problem?

CS/ECE 752 (Wood): Pipelining

### **Correlated Predictor**

- What happened?
  - BHR wasn't long enough to capture the pattern
  - Try again: BHT+3BHR: 8 1-bit DIRP entries

| State/prediction | BHR=NNN | N* | Т  | Т  | T | Т  | T  | Т | T | Т | T | Т | T  |
|------------------|---------|----|----|----|---|----|----|---|---|---|---|---|----|
|                  | BHR=NNT | N  | N* | Т  | Ŧ | Т  | T  | Т | T | Т | Ŧ | Т | T. |
|                  | BHR=NTN | N  | N  | N  | N | N  | N  | N | N | N | N | N | N  |
| "active pattern" | BHR=NTT | N  | N  | N* | Т | Т  | Т  | Т | Т | Т | Т | Т | Т  |
|                  | BHR=TNN | N  | N  | N  | N | N  | N  | N | N | N | N | N | N  |
|                  | BHR=TNT | N  | N  | N  | N | N  | N* | Т | Т | Т | Т | Т | Т  |
|                  | BHR=TTN | N  | N  | N  | N | N* | Т  | Т | Т | Т | Т | Т | Т  |
|                  | BHR=TTT | N  | N  | N  | N | N  | N  | N | N | N | N | N | N  |
| Outcome          |         | Т  | Ť  | Т  | N | Т  | Ť  | Т | N | Т | T | Т | N  |

+ No mis-predictions after predictor learns all the relevant patterns CS/ECE 752 (Wood): Pipelining

### **Correlated Predictor**

- Design choice I: one global BHR or one per PC (local)?
  - · Each one captures different kinds of patterns
  - Global is better, captures local patterns for tight loop branches
  - · Combination is better still
- · Design choice II: how many history bits (BHR size)?

  - + Longer BHRs are better for some apps, shorter better for others
  - BHT utilization decreases w/ long BHRs
    - Many history patterns are never seen
    - Many branches are history independent (don't care)
       PC ^ BHR allows multiple PCs to dynamically share BHT
    - BHR length < log<sub>2</sub>(BHT size)
  - Long BHR takes longer to train
- Typical length: 8-12, some predictors use multiple lengths

CS/ECE 752 (Wood): Pipelining



### Predictor Updates

- Speculative update
  - · Need to update history before we know if predictor is correct
  - · Assume correct, fix up if wrong
  - · Harder to do with local predictors (lots more state).
- · Partial v. Full updates in hybrid predictors
  - · Full = update all predictors
  - Partial = only update a subset of predictors
    - On correct prediction:
      - All agree no update
      - If different, strengthen chooser and correct predictor(s)
    - On incorrect prediction:
      - Update chooser
        - If now correct, strengthen correct predictor(s)
        - If still wrong, update all predictors

CS/ECE 752 (Wood): Pipelining

38



## Branch Prediction Performance Same parameters Branch: 20%, load: 20%, store: 10%, other: 50% 75% of branches are taken Dynamic branch prediction Branches predicted with 95% accuracy CPI = 1 + 0.20\*0.05\*2 = 1.02

### **Data Hazards** · Real insn sequences pass values via registers/memory • Three kinds of data dependences (where's the fourth?) add r2.r3→r1 add r2.r3→r1 add r2.r3 sub r1.r4→r2 sub r5.r4→r2 sub r1.r4→r2 or r6.r3→r1 or r6.r3=r1 or r6.r3**⇒**r1 Read-after-write (RAW) | Write-after-read (WAR) Write-after-write (WAW) True-dependence Anti-dependence Output-dependence · Only one dependence between any two insns (RAW has priority) • Dependence is property of the program and ISA Data hazards: function of data dependences and pipeline · Potential for executing dependent insns in wrong order · Require both insns to be in pipeline ("in flight") simultaneously CS/ECE 752 (Wood): Pipelining

```
Dependences and Loops

    Data dependences in loops

     · Intra-loop: within same iteration
     • Inter-loop: across iterations
     • Example: DAXPY (Double precision A X Plus Y)
                                 • RAW intra: 0→1(f2), 1→3(f4),
2→3(f6), 3→4(f8), 5→6(r1), 6→7(r2)
for (i=0:i<100:i++)
   Z[i]=A*X[i]+Y[i];
                                     RAW inter: 5 \rightarrow 0(r1), 5 \rightarrow 2(r1),
0: ldf f2,X(r1)
                                      5→4(r1), 5→5(r1)
    mulf f2.f0.f4
                                     WAR intra: 0\rightarrow 5(r1), 2\rightarrow 5(r1), 4\rightarrow 5(r1)
2: ldf f6,Y(r1)
                                     WAR inter: 1\rightarrow 0(f2), 3\rightarrow 1(f4), 3\rightarrow 2(f6), 4\rightarrow 3(f8), 6\rightarrow 5(r1), 7\rightarrow 6(r2)
3: addf f4,f6,f8
4: stf f8.Z(r1)

    WAW intra: none

5: addi r1,8,r1
                                     WAW inter: 0\rightarrow 0(f2), 1\rightarrow 1(f4),
 6: cmplti r1,800,r2
                                     2\rightarrow 2(f6), 3\rightarrow 3(f8), 6\rightarrow 6(r2)
7: beq r2,Loop
CS/ECE 752 (Wood): Pipelining
```

















```
Compiler Scheduling Requires

• Large scheduling scope

• Independent instruction to put between load-use pairs

+ Original example: large scope, two independent computations

- This example: small scope, one computation

Before

After

1d r2,4(sp)
1d r3,8(sp)
1d r3,8(sp)
1d r3,8(sp)
1d r3,72,r1 //stall
1 st r1,0(sp)

CS/ECE 752 (Wood): Pipelining
```

```
Compiler Scheduling Requires

    Enough registers

   . To hold additional "live" values

    Example code contains 7 different values (including sp)

    Before: max 3 values live at any time → 3 registers enough

   • After: max 4 values live → 3 registers not enough → WAR violations
Original
 ld r2,4(sp)
                                ld r2,4(sp)
                              ld r1,8(sp)
ld r2,16(sp)
add r1,r2,r1
ld r1,8(sp)
 add r1,r2,r1 //stall
 st r1,0(sp)
 ld r2,16(sp)-
                                ld r1,20(sp)
 ld r1,20(sp)
                                st r1,0(sp)
 sub r2, r1, r1
                 //stall
                                sub r2, r1
 st r1,12(sp)
                                st r1,12(sp)
CS/ECE 752 (Wood): Pipelining
                                                               52
```

```
Compiler Scheduling Requires
· Alias analysis (maybe dependence)
   · Ability to tell whether load/store reference same memory locations
      • Effectively, whether load/store can be rearranged
   • Example code: easy, all loads/stores use same base register (sp)
   • New example: can compiler tell that r8 = sp?
 Before
                             Wrong(?)
 ld r2,4(sp)
                             ld r2,4(sp)
 ld r3,8(sp)
                             ld r3,8(sp)
 add r3,r2,r1 //stall
                            Id r5,0(r8)
 st r1,0(sp) -
                             add r3,r2,r1
ld r5,0(r8)
                             ld r6,4(r8)
 ld r6,4(r8)
                             st r1,0(sp)
                             sub r5, r6, r4
 sub r5,r6,r4 //stall
 st r4.8(r8)
                             st r4.8(r8)
CS/ECE 752 (Wood): Pipelining
```

```
WAW Hazards

• Write-after-write (WAW)

add r2,r3,r1

sub r1,r4,r2

or r6,r3,r1

• Compiler effects

• Scheduling problem: reordering would leave wrong value in r1

• Later instruction reading r1 would get wrong value in r1

• Later instruction reading r1 would get wrong value

• Artificial: no value flows through dependence

• Eliminate using different output register name for or

• Pipeline effects

• Doesn't affect in-order pipeline with single-cycle operations

• One reason for making ALU operations go through M stage

• Can happen with multi-cycle operations (e.g., FP or cache misses)
```



### 

56

CS/ECE 752 (Wood): Pipelining





# WAR Hazards • Write-after-read (WAR) add r2,r3,r1 sub r5,r4,r2 or r6,r3,r1 • Compiler effects • Scheduling problem: reordering would mean add uses wrong value for r2 • Artificial: solve using different output register name for sub • Pipeline effects • Can't happen in simple in-order pipeline • Can happen with out-of-order execution





### **Pipeline Performance Summary**

- Base CPI is 1, but hazards increase it
- Nothing magical about a 5 stage pipeline
  - Pentium4 has 22 stage pipeline
- · Increasing pipeline depth
  - + Increases clock frequency (that's why companies do it)
     But decreases IPC
  - · Branch mis-prediction penalty becomes longer
    - More stages between fetch and whenever branch computes
  - Non-bypassed data hazard stalls become longer
    - More stages between register read and write
  - At some point, CPI losses offset clock gains, question is when?

CS/ECE 752 (Wood): Pipelining

.....

### **Dynamic Pipeline Power**

- Remember control-speculation game
  - [2 cycles \* %<sub>correct</sub>] [**0 cycles** \* (1–%<sub>correct</sub>)]
  - No penalty → mis-speculation no worse than stalling
  - This is a performance-only view
  - From a power standpoint, mis-speculation is worse than stalling
- Power control-speculation game
  - [0 nJ \* %<sub>correct</sub>] [X nJ \* (1-%<sub>correct</sub>)]
  - No benefit  $\rightarrow$  correct speculation no better than stalling
    - Not exactly, increased execution time increases static power
  - · How to balance the two?

CS/ECE 752 (Wood): Pipelining

63

### Research: Speculation Gating

- Speculation gating [Manne+]
  - Extend branch predictor to give prediction + confidence
  - · Speculate on high-confidence (mis-prediction unlikely) branches
  - Stall (save energy) on low-confidence branches
- Confidence estimation
  - What kind of hardware circuit estimates confidence?
  - Hard in absolute sense, but easy relative to given threshold
  - $\bullet$  Counter-scheme similar to  $\%_{\rm miss}$  threshold for cache resizing
  - Example: assume 90% accuracy is high confidence
    - PC-indexed table of confidence-estimation counters
    - Correct prediction? table[PC]+=1: table[PC]-=9;
    - Prediction for PC is confident if table[PC] > 0;

CS/ECE 752 (Wood): Pipelining

### Summary

- · Principles of pipelining
  - Effects of overhead and hazards
- Pipeline diagrams
- · Data hazards
  - Stalling and bypassing
- · Control hazards
- Branch predictionPower techniques
  - Dynamic power: speculation gating
  - Static and dynamic power: razor latches

CS/ECE 752 (Wood): Pipelining

5