### U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Prof. David A. Wood

Unit 5: Dynamic Scheduling I

Slides developed by Amir Roth of University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood.

Slides enhanced by Milo Martin, Mark Hill, and David Wood with sources that included Profs. Asanovic, Falsafi, Hoe, Lipasti, Shen, Smith, Sohi, Vijaykumar, and Wood

CS/ECE 752 (Wood): Dynamic Scheduling I



## The Problem With In-Order Pipelines 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 addf f0,f1,f2 mulf f2,f3,f2 subf f0,f1,f4 • What's happening in cycle 4? • mulf stalls due to RAW hazard • OK, this is a fundamental problem • subf stalls due to pipeline (structural) hazard • Why? subf can't proceed into D because addf is there • That is the only reason, and it isn't a fundamental one • Why can't subf go into D in cycle 4 and E+ in cycle 6?



### Register Renaming • To eliminate WAW and WAR hazards Example • Names: r1,r2,r3 • Locations: p1,p2,p3,p4,p5,p6,p7 • Original mapping: r1→p1, r2→p2, r3→p3, p4-p7 are "free" MapTable FreeList Raw insns Renamed insns r1 r2 r3 p1 p2 p3 add r2,r3,r1 add p2,p3,p4 p4,p5,p6,p7 sub p2,p4,p5 mul p2,p5,p6 div p4,4,p7 p4 p2 p3 p4 p2 p5 p5,p6,p7 sub r2, r1, r3 mul r2, r3, r3 p4 p2 p5 p4 p2 p6 + Removes WAW and WAR dependences + Leaves RAW intact! CS/ECE 752 (Wood): Dynamic Scheduling I

### Dynamic Scheduling - OoO Execution Dynamic scheduling Totally in the hardware Also called "out-of-order execution" (OoO) Fetch many instructions into instruction window Use branch prediction to speculate past (multiple) branches Flush pipeline on branch misprediction Rename to avoid false dependencies (WAW and WAR) Execute instructions as soon as possible Register dependencies are known Handling memory dependencies more tricky (much more later) Commit instructions in order Any strange happens before commit, just flush the pipeline Current machines: 64-200+ instruction scheduling window

### Static Instruction Scheduling

- Issue: time at which insns execute
- Schedule: order in which insns execute
  - · Related to issue, but the distinction is important
- Scheduling: re-arranging insns to enable rapid issue
  - · Static: by compiler
  - Requires knowledge of pipeline and program dependences
  - · Pipeline scheduling: the basics Requires large scheduling scope full of independent insns
    - Loop unrolling, software pipelining: increase scope for loops
    - Trace scheduling: increase scope for non-loops

### Anything software can do ... hardware can do better

CS/ECE 752 (Wood): Dynamic Scheduling I

### Motivation Dynamic Scheduling

- · Dynamic scheduling (out-of-order execution)
  - Execute insns in non-sequential (non-VonNeumann) order...
    - + Reduce RAW stalls
    - + Increase pipeline and functional unit (FU) utilization
      - Original motivation was to increase FP unit utilization
    - + Expose more opportunities for parallel issue (ILP) Not in-order → can be in parallel
  - · ...but make it appear like sequential execution
    - Important - But difficult

    - Next unit

CS/ECE 752 (Wood): Dynamic Scheduling I

### Before We Continue

- · If we can do this in software...
- ...why build complex (slow-clock, high-power) hardware?
  - + Performance portability
  - Don't want to recompile for new machines
  - + More information available
  - · Memory addresses, branch directions, cache misses
  - + More registers available (??)
    - · Compiler may not have enough to fix WAR/WAW hazards
  - + Easier to speculate and recover from mis-speculation
    - · Flush instead of recover code
  - But compiler has a larger scope
    - · Compiler does as much as it can (not much)
    - · Hardware does the rest

CS/ECE 752 (Wood): Dynamic Scheduling I

### Going Forward: What's Next

- We'll build this up in steps over the next few weeks
  - "Scoreboarding" first OoO, no register renaming
  - "Tomasulo's algorithm" adds register renaming
  - · Handling precise state and speculation
    - P6-style execution (Intel Pentium Pro)
    - . R10k-style execution (MIPS R10k)
  - Handling memory dependencies
    - · Conservative and speculative
- · Let's get started!

CS/ECE 752 (Wood): Dynamic Scheduling I

10

8

### Dynamic Scheduling as Loop Unrolling

- Three steps of loop unrolling
  - · Step I: combine iterations
  - Increase scheduling scope for more flexibility
  - Step II: pipeline schedule
    - Reduce impact of RAW hazards
  - · Step III: rename registers
    - Remove WAR/WAW violations that result from scheduling

11

CS/ECE 752 (Wood): Dynamic Scheduling I

### Loop Example: SAX (SAXPY - PY)

• SAX (Single-precision A X)

for (i=0;i<N;i++)

. Only because there won't be room in the diagrams for SAXPY

```
Z[i]=A*X[i];
                     // loop
0: ldf X(r1),f1
1: mulf f0,f1,f2
                     // A in f0
2: stf f2,Z(r1)
3: addi r1,4,r1
                     // i in r1
                     // N*4 in r2
4: blt r1,r2,0
```

 Consider two iterations, ignore branch ldf, mulf, stf, addi, ldf, mulf, stf













### Dynamic Scheduling Algorithms Three parts to loop unrolling Scheduling scope: insn buffer Pipeline scheduling and register renaming: scheduling algorithm Look at two register scheduling algorithms

- Register scheduler: scheduler based on register dependences
   Scoreboard
- No register renaming → limited scheduling flexibility
- Tomasulo
  - ullet Register renaming ullet more flexibility, better performance
- Big simplification in this unit: memory scheduling
  - Pretend register algorithm magically knows memory dependences
  - · A little more realism next unit

CS/ECE 752 (Wood): Dynamic Scheduling I

19

21

### Scheduling Algorithm I: Scoreboard Scoreboard Centralized control scheme: insn status explicitly tracked Insn buffer: Functional Unit Status Table (FUST) First implementation: CDC 6600 [1964] Separate non-pipelined functional units (7 int, 4 FP, 5 mem) No register bypassing Our example: "Simple Scoreboard" Separate Non-pipelined Storeboard" Separate Non-pipelined Storeboard"

20

CS/ECE 752 (Wood): Dynamic Scheduling I

Simple Scoreboard Data Structures

FU Status Table
FU, busy, op, R, R1, R2: destination/source register names
T: destination register tag (FU producing the value)
T1,72: source register tags (FU producing the values)

- Register Status Table
- T: tag (FU that will write this register)
- Tags interpreted as ready-bits
  - Tag  $== 0 \rightarrow Value$  is ready in register file
  - Tag !=  $0 \rightarrow \text{Value}$  is not ready, will be supplied by T
- Insn status table
  - S,X bits for all active insns

CS/ECE 752 (Wood): Dynamic Scheduling I

Fetched R1 R2 R op T T1 T2 IIII CAMS

Fu Status

Insn fields and status bits

Tags

Values

CS/ECE 752 (Wood): Dynamic Scheduling I 22

# Scoreboard Pipeline New pipeline structure: F, D, S, X, W F (fetch) Same as it ever was D (dispatch) Structural or WAW hazard ? stall: allocate scoreboard entry S (issue) RAW hazard ? wait: read registers, go to execute X (execute) Execute operation, notify scoreboard when done W (writeback) WAR hazard ? wait: write register, free scoreboard entry W and RAW-dependent S in same cycle W and structural-dependent D in same cycle W and structural-dependent D in same cycle



































|      |                                                                         | In-Order |      |      | Scoreboard |      |        |        |
|------|-------------------------------------------------------------------------|----------|------|------|------------|------|--------|--------|
|      | Insn                                                                    | D        | Χ    | W    | D          | S    | Х      | W      |
|      | ldf X(r1),f1                                                            | c1       | c2+  | с7   | c1         | c2   | c3+    | с8     |
|      | mulf f0,f1,f2                                                           | c7       | c8+  | c11  | c2         | с8   | c9+    | c12    |
|      | stf f2,Z(r1)                                                            | c11      | c12  | c13  | с3         | c12  | c13    | c14    |
|      | addi r1,4,r1                                                            | c12      | c13  | c14  | c4         | с5   | с6     | c13    |
|      | ldf X(r1),f1                                                            | c14      | c15  | c16  | с5         | c13  | c14    | c15    |
|      | mulf f0,f1,f2                                                           | c16      | c17+ | c20  | c6         | c15  | c16+   | c19    |
|      | stf f2,Z(r1)                                                            | c20      | c21  | c22  | с7         | c19  | c20    | c21    |
| • Ig | me<br>cycle cache miss or<br>nore FUST structur<br>tle relative advanta | al haz   |      |      |            |      |        |        |
|      | addi WAR haza                                                           | ard (    | 27 → | c13) | stalls     | seco | nd ite | ration |
|      |                                                                         |          |      |      |            |      |        |        |

### Scoreboard Redux The good + Cheap hardware • InsnStatus + FuStatus + RegStatus ~ 1 FP unit in area + Pretty good performance • 1.7X for FORTRAN (scientific array) programs · The less good No bypassing Is this a fundamental problem? - Limited scheduling scope Structural/WAW hazards delay dispatch - Slow issue of truly-dependent (RAW) insns • WAR hazards delay writeback Fix with hardware register renaming CS/ECE 752 (Wood): Dynamic Scheduling I 42

## Scoreboard Pipeline Recap • New pipeline structure: F, D, S, X, W • D (dispatch) • Structural or WAW hazard ? stall : allocate scoreboard entry • S (issue) • RAW hazard ? wait : read registers, go to execute • Detect? FUStatus.Ti!= 0 → waiting for write • W (writeback) • WAR hazard? wait : write register, free scoreboard entry • Detect WAR hazard? FUStatus.Ri matches && FUStatus.Ti == 0 • Detect RAW hazard? FUStatus.Ti matches



44





















































### Can We Add Superscalar? Dynamic scheduling and multiple issue are orthogonal . E.g., Pentium4: dynamically scheduled 5-way superscalar • Two dimensions • N: superscalar width (number of parallel operations) W: window size (number of reservation stations) • What do we need for an N-by-W Tomasulo? • RS: N tag/value w-ports (D), 2Nvalue r-ports (S), 2N tag CAMs (W) Select logic: W→N priority encoder (S) • MT: 2N r-ports (D), N w-ports (D) • RF: 2N r-ports (D), N w-ports (W) CDB: N (W) · Which are the expensive pieces? CS/ECE 752 (Wood): Dynamic Scheduling I 71

### Superscalar Select Logic Superscalar select logic: W→N priority encoder Somewhat complicated (N² logW) Can simplify using different RS designs Split design Divide RS into N banks: 1 per FU? Implement N separate W/N→1 encoders Simpler: N\* logW/N Less scheduling flexibility FIFO design [Palacharla+] Can issue only head of each RS bank Simpler: no select logic at all Less scheduling flexibility (but surprisingly not that bad)

72



Why Out-of-Order Bypassing Is Hard

No Bypassing

| Insn          | D   | S   | X    | W   | D   | S   | Х   | W   |
|---------------|-----|-----|------|-----|-----|-----|-----|-----|
| ldf X(r1),f1  | c1  | c2  | с3   | с4  | c1  | c2  | с3  | с4  |
| mulf f0,f1,f2 | c2  | с4  | c5+  | c8  | c2  | с3  | c4+ | c7  |
| stf f2,Z(r1)  | с3  | с8  | с9   | c10 | с3  | с6  | с7  | с8  |
| addi r1,4,r1  | с4  | с5  | с6   | с7  | с4  | с5  | с6  | с7  |
| ldf X(r1),f1  | с5  | с7  | с8   | c9  | с5  | с7  | c7  | с9  |
| mulf f0,f1,f2 | с6  | с9  | c10+ | c13 | с6  | с9  | c8+ | c13 |
| stf f2,Z(r1)  | c10 | c13 | c14  | c15 | c10 | c13 | c11 | c15 |

- Bypassing: 1df X in c3 → mulf X in c4 → mulf S in c3
  - But how can mulf S in c3 if ldf W in c4? Must change pipeline
- Modern scheduler
  - Split CDB tag and value, move tag broadcast to S
    - ldf tag broadcast now in cycle 2 → mulf S in cycle 3
  - How do multi-cycle operations work? How do cache misses work?

CS/ECE 752 (Wood): Dynamic Scheduling I

### **Dynamic Scheduling Summary**

- Dynamic scheduling: out-of-order execution
  - Higher pipeline/FU utilization, improved performance
  - Easier and more effective in hardware than software
  - + More storage locations than architectural registers
  - + Dynamic handling of cache misses
- Instruction buffer: multiple F/D latches
  - Implements large scheduling scope + "passing" functionality
  - Split decode into in-order dispatch and out-of-order issue
  - Stall vs. wait
- Dynamic scheduling algorithms
  - Scoreboard: no register renaming, limited out-of-order
  - Tomasulo: copy-based register renaming, full out-of-order