### U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Prof. David A. Wood

Unit 4: Multiple Issue and Static Scheduling

Slides developed by Amir Roth of University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood.

Slides enhanced by Milo Martin, Mark Hill, and David Wood with sources that included Profs. Asanovic, Falsafi, Hoe, Lipasti, Shen, Smith, Sohi, Vijaykumar, and Wood

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

### Lots of Parallelism...

- · Last unit: pipeline-level parallelism
  - Work on execute of one instruction in parallel with decode of next
- Next: instruction-level parallelism (ILP)
  - Execute multiple independent instructions fully in parallel
  - · Today: limited multiple issue
  - · Next Unit: dynamic scheduling
    - Extract much more ILP via out-of-order processing
- Data-level parallelism (DLP)
  - · Single-instruction, multiple data
  - Example: one instruction, four 16-bit adds (using 64-bit registers)
- Thread-level parallelism (TLP)
  - · Multiple software threads running on multiple processors

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling









### • Wide instruction fetch • Modest: need multiple instructions per cycle • Aggressive: predict multiple branches, trace cache • Wide instruction decode

- Replicate decoders
- Replicate decoders
- Dependences between instructions decoded in same cycle
- · Wide instruction issue
  - · Determine when instructions can proceed in parallel
  - Not all combinations possible
  - More complex stall logic order N<sup>2</sup> for N-wide machine
- Wide register read
  - · One port for each register read
  - Example, 4-wide superscalar → >= 8 read ports

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

### Superscalar Challenges - Back End

- · Wide instruction execution
  - · Replicate arithmetic units
  - · Multiple cache ports
- · Wide instruction register writeback
  - · One write port per instruction that writes a register
- Example, 4-wide superscalar → >= 4 write ports
- · Wide bypass paths
  - More possible sources for data values
  - Order (N $^2$  \* P) for N-wide machine with execute pipeline depth P
- · Fundamental challenge:
  - · Amount of ILP (instruction-level parallelism) in the program
  - Compiler must schedule code and extract parallelism

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling





















### Aside: Multiple-issue CISC

- · How do we apply superscalar techniques to CISC
  - Such as x86
  - · Or CISCy ugly instructions in some RISC ISAs
- Break "macro-ops" into "micro-ops"
  - Also called "uops" or "RISC-ops'
  - A typical CISCy instruction "add [r1], [r2] → [r3]" becomes:
    - Load [r1] → t1 (t1 is a temp. register, not visible to software)
    - Load [r2] → t2
    - Add t1, t2 → t3
    - Store t3→[r3]
  - However, conversion is expensive (latency, area, power)
  - · Solution: cache converted instructions in trace cache
    - Used by Pentium 4
    - Internal pipeline manipulates only these RISC-like instructions

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

Wide Decode



- · What is involved in decoding multiple (N) insns per cycle?
- · Actually doing the decoding?
  - Easy if fixed length (multiple decoders), doable if variable length
- · Reading input registers?
  - 2N register read ports (latency ∝ #ports)
  - + Actually less than 2N, most values come from bypasses
- · What about the stall logic?

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

20

M d\*

### N<sup>2</sup> Dependence Cross-Check

- Stall logic for 1-wide pipeline with full bypassing
  - Full bypassing = load/use stalls only
  - X/M.op==LOAD && (D/X.rs1==X/M.rd || D/X.rs2==X/M.rd)
- Now: same logic for a 2-wide pipeline

 $X/M_1.op==LOAD \&\& (D/X_1.rs1==X/M_1.rd || D/X_1.rs2==X/M_1.rd) ||$  $X/M_1.op==LOAD \&\& (D/X_2.rs1==X/M_1.rd || D/X_2.rs2==X/M_1.rd) ||$  $X/M_2.op = = LOAD && (D/X_1.rs1 = = X/M_2.rd || D/X_1.rs2 = = X/M_2.rd) ||$  $X/M_2.op == LOAD \&\& (D/X_2.rs1 == X/M_2.rd || D/X_2.rs2 == X/M_2.rd)$ 

- Eight "terms": α 2N²
- This is the N<sup>2</sup> dependence cross-check
- Not quite done, also need
  - D/X<sub>2</sub>.rs1==D/X<sub>1</sub>.rd || D/X<sub>2</sub>.rs2==D/X<sub>1</sub>.rd

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

21

### Superscalar Stalls

- Invariant: stalls propagate upstream to younger insns
- If older insn in pair stalls, younger insns must stall too
- What if younger insn stalls?
  - Can older insn from younger group move up?
  - · Fluid: yes, but requires some muxing
  - ± Helps CPI a little, hurts clock a little, moves check out of decode
  - Rigid: no

± Hurts CPI a little, but doesn't impact clock

| Rigid              | 1        | 2      | 3     | 4          | 5     | Fluid        | 1 | 2 | 7 |
|--------------------|----------|--------|-------|------------|-------|--------------|---|---|---|
| ld 0(r1),r4        | F        | D      | Х     | М          | W     | ld 0(r1),r4  | F | D | 7 |
| addi r4,1,r4       | F        | D      | d*    | d*         | Х     | addi r4,1,r4 | F | D | c |
| sub r5,r2,r3       | -        | F      | p*    | <b>p</b> * | D     | sub r5,r2,r3 |   | F | 1 |
| st r3,0(r1)        | 4        | F      | p*    | <b>D</b> * | D     | st r3,0(r1)  | 4 | F | - |
| ld 4(r1),r8        |          |        | •     | -          | F     | ld 4(41),r8  |   |   | ÷ |
| CS/ECE 752 (Wood): | Multiple | e Issu | e & S | tatic S    | chedu | lina         |   |   |   |

### Wide Execute



- What is involved in executing multiple (N) insns per cycle?
- Multiple execution units ... N of every kind?
  - N ALUs? OK, ALUs are small
  - N FP dividers? No, FP dividers are huge and  ${\tt fdiv}$  is uncommon
  - How many branches per cycle?
  - · How many loads/stores per cycle?
  - Typically some mix of functional units proportional to insn mix
    - Intel Pentium: 1 any + 1 ALU

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

### Wide Memory Access

- · How do we allow multiple loads/stores to execute?
  - Option#1: Extra read ports on data cache
     Higher latency, etc.

  - Option#2: "Bank" the cache
     Can support a load to an "odd" and an "even" address
    - Problem: address not known to execute stage
      - · Complicates stall logic
  - · With two banks, conflicts will occur frequently
  - Option #3: Replicate the cache
    - · Multiple read bandwidth only
    - Larger area, but no conflicts, can be faster than more ports
  - Independent reads to replicas, writes (stores) go to all replicas
     Option #4: Pipeline the cache ("double pump")
  - Start cache access every half cycle
  - · Difficult circuit techniques
- Example: the Alpha 21164 uses option #3
- 8KB L1-caches, supports two loads, but only one store

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling





# Wide Writeback • What is involved in multiple (N) writebacks per cycle? • N register file write ports (latency ≈ #ports) • Must handle multiple writes to same register (or stall in decode) • Usually less than N, stores and branches don't do writeback • But some ISAs have update or auto-incr/decr addressing modes • Multiple exceptions per cycle? • No just the oldest one CS/ECE 752 (Wood): Multiple Issue & Static Scheduling 27

### Multiple-Issue Implementations Statically-scheduled (in-order) superscalar + Executes unmodified sequential programs - Hardware must figure out what can be done in parallel • E.g., Pentium (2-wide), UltraSPARC (4-wide), Alpha 21164 (4-wide) Very Long Instruction Word (VLIW) + Hardware can be dumb and low power - Compiler must group parallel insns, requires new binaries • E.g., TransMeta Crusoe (4-wide) Explicitly Parallel Instruction Computing (EPIC) • A compromise: compiler does some, hardware does the rest • E.g., Intel Itanium (6-wide) Dynamically-scheduled superscalar • Pentium Pro/II/III (3-wide), Alpha 21264 (4-wide) • We've already talked about statically-scheduled superscalar

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

# • Hardware-centric multiple issue problems - Wide fetch+branch prediction, N² bypass, N² dependence checks - Hardware solutions have been proposed: clustering, trace cache • Software-centric: very long insn word (VLIW) • Effectively, a 1-wide pipeline, but unit is an N-insn group • Compiler guarantees insns within a VLIW group are independent • If no independent insns, slots filled with nops • Group travels down pipeline as a unit + Simplifies pipeline control (no rigid vs. fluid business) + Cross-checks within a group un-necessary • Downstream cross-checks (maybe) still necessary • Typically "slotted": 1st insn must be ALU, 2nd mem, etc. + Further simplification

### History of VLIW Started with "horizontal microcode" Culler-Harrison array processors ('72-'91) Floating Point Systems FPS-120B Academic projects Yale ELI-512 [Fisher, '85] Illinois IMPACT [Hwu, '91] Commercial attempts Multiflow [Colwell-Fisher, '85] → failed Cydrome [Rau, '85] → failed Motorolla/TI embedded processors → successful as DSPs Intel Itanium [Colwell,Fisher+Rau, '97] → ?? Transmeta Crusoe [Ditzel, '99] → failed

### Pure and Tainted VLIW

- Pure VLIW: no hardware dependence checks at all
  - · Not even between VLIW groups
  - + Very simple and low power hardware
  - · Compiler responsible for scheduling stall cycles
  - Requires precise knowledge of pipeline depth and structure
    - These must be fixed for compatibility
  - Doesn't support caches well
  - Used in some cache-less micro-controllers and signal processors
    - · Not useful for general-purpose computation
- Tainted (more realistic) VLIW: inter-group checks
  - . Compiler doesn't schedule stall cycles
  - + Precise pipeline depth and latencies not needed, can be changed

31

33

- + Supports caches
- TransMeta Crusoe

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

### What Does VLIW Actually Buy Us?

- + Simpler I\$/branch prediction
  - No trace cache necessary
- + Simpler dependence check logic
- · Bypasses are the same
  - · Clustering can help VLIW, too
  - Compiler can schedule for limited bypass networks
- Not compatible across machines of different widths
  - · Is non-compatibility worth all of this?
- PS how does TransMeta deal with compatibility problem?
  - · Dynamically translates x86 to internal VLIW

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

32

### **EPIC**

- Tainted VLIW
  - · Compatible across pipeline depths
  - But not across pipeline widths and slot structures
  - Must re-compile if going from 4-wide to 8-wide
  - TransMeta sidesteps this problem by re-compiling transparently
- EPIC (Explicitly Parallel Insn Computing)
  - New VLIW (Variable Length Insn Words)
  - Implemented as "bundles" with explicit dependence bits
  - · Code is compatible with different "bundle" width machines
  - Compiler discovers as much parallelism as it can, hardware does rest
  - E.g., Intel Itanium (IA-64)
  - 128-bit bundles (3 41-bit insns + 5 template bits)

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

### ILP and Static Scheduling

- · No point to having an N-wide pipeline...
- ...if average number of parallel insns per cycle (ILP) << N
- How can the compiler help extract parallelism?
  - · These techniques applicable to regular superscalar
  - · These techniques critical for VLIW/EPIC

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

34

### Code Example: SAXPY

- SAXPY (Single-precision A X Plus Y)
  - Linear algebra routine (used in solving systems of equations)
  - · Part of early "Livermore Loops" benchmark suite

```
for (i=0;i<N;i++)
  Z[i]=A*X[i]+Y[i];
                       // loop
0: ldf X(r1).f1
1: mulf f0,f1,f2
                       // A in f0
2: ldf Y(r1),f3
                       // X,Y,Z are constant addresses
3: addf f2.f3.f4
4: stf f4,Z(r1)
5: addi r1,4,r1
                       // i in r1
6: blt r1,r2,0
                       // N*4 in r2
CS/ECE 752 (Wood): Multiple Issue & Static Scheduling
                                                         35
```

### SAXPY Performance and Utilization

3 4 5 b
X M W
D d\* E\* E\* E\* E\* E\* W
F p\* D X M W
F D d\* D d\* d\* d\* E+ E+ W
F p\* p\* p\* D X M W
F D X M W
F D X M W
F D X M W 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ldf X(r1),f1 mulf f0,f1,f2 ldf Y(r1),f3 addf f2,f3,f4 stf f4,Z(r1) addi rl.4.rl blt r1,r2,0 ldf X(r1),f1

- Scalar pipeline
  - Full bypassing, 5-cycle E\*, 2-cycle E+, branches predicted taken
  - Single iteration (7 insns) latency: 16–5 = 11 cycles
  - Performance: 7 insns / 11 cycles = 0.64 IPC

• Utilization: 0.64 actual IPC / 1 peak IPC = 64%

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

### SAXPY Performance and Utilization 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 F D X M W F D d\* d\* E\* E\* E\* E\* E\* E\* W ldf X(r1),f1 mulf f0,f1,f2 F D p\* X M W F p\* p\* D d\* d\* d\* d\* E+ E+ W F p\* D p\* p\* p\* p\* p\* d\* X M W F p\* p\* D p\* p\* p\* p\* D X M W ldf Y(r1),f3 addf f2,f3,f4 stf f4,Z(r1) addi r1,4,r1 blt r1,r2,0 ldf X(r1),f1 p\* p\* p\* p\* p\* D d\* X M W F D X M W • Dual issue pipeline (fluid) · Same + any two insns per cycle + embedded taken branches + Performance: 7 insns / 10 cycles = 0.70 IPC - Utilization: 0.70 actual IPC / 2 peak IPC = 35% More hazards → more stalls (why?) - Each stall is more expensive (why?) CS/ECE 752 (Wood): Multiple Issue & Static Scheduling 37

### Schedule and Issue

- Issue: time at which insns begin execution
  - . Want to maintain issue rate of N
- Schedule: order in which insns execute
  - In in-order pipeline, schedule + stalls determine issue
  - A good schedule that minimizes stalls is important
     For both performance and utilization
- Schedule/issue combinations
  - · Pure VLIW: static schedule, static issue
  - Tainted VLIW: static schedule, partly dynamic issue
  - Superscalar, EPIC: static schedule, dynamic issue

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

38

### **Instruction Scheduling**

- Idea: place independent insns between slow ops and uses
  - Otherwise, pipeline stalls while waiting for RAW hazards to resolve
  - · Have already seen pipeline scheduling
- · To schedule well need ... independent insns
- Scheduling scope: code region we are scheduling
  - The bigger the better (more independent insns to choose from)
  - Once scope is defined, schedule is pretty obvious
  - Trick is creating a large scope (must schedule across branches)
- Compiler scheduling (really scope enlarging) techniques
  - Loop unrolling (for loops)
  - Trace scheduling (for non-loop control flow)

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

### Aside: Profiling

- Profile: statistical information about program tendencies
  - Software's answer to everything
  - Collected from previous program runs (different inputs)
  - ± Works OK depending on information
    - · Memory latencies (cache misses)
      - + Identities of frequently missing loads stable across inputs
    - But are tied to cache configuration & dataset size
    - Memory dependences
      - + Stable across inputs
    - But exploiting this information is hard (need hw help)
    - Branch outcomes
      - Not so stable across inputs
  - More difficult to use, need to run program and then re-compile
  - Much prior research

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

40

### Loop Unrolling SAXPY

- Goal: separate dependent insns from one another
- SAXPY problem: not enough flexibility within one iteration
  - Longest chain of insns is 9 cycles
    - Load (1)
    - Forward to multiply (5)
    - Forward to add (2)
    - Forward to store (1)
  - Can't hide a 9-cycle chain using only 7 insns
- But how about two 9-cycle chains using 14 insns?
- Loop unrolling: schedule two or more iterations together
- Fuse iterations
- Pipeline schedule to reduce RAW stalls
- Pipeline schedule introduces WAR violations, rename registers to fix

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

### Unrolling SAXPY I: Fuse Iterations

- Combine two (in general K) iterations of loop
  - Fuse loop control: induction variable (i) increment + branch
  - · Adjust implicit induction uses

```
1df X(r1),f1
1df X(r1) f1
                                     mulf f0,f1,f2
ldf Y(r1),f3
ldf Y(r1),f3
addf f2,f3,f4
stf f4,Z(r1)
                                     stf f4,Z(r1)
addi r1,4,r1
blt r1,r2,0
ldf X(r1),f1
                                     ldf X+4(r1),f1
mulf f0,f1,f2
                                      mulf f0,f1,f2
ldf Y(r1),f3
                                      ldf Y+4(r1),f3
addf f2,f3,f4
                                     addf f2,f3,f4
stf f4,Z(r1)
                                     stf f4,Z+4(r1)
addi r1,8,r1
blt r1.r2.0
                                     blt r1.r2.0
```

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling









## Problem: not everything is a loop How to create large scheduling scopes from non-loop code? Idea: trace scheduling [Ellis, '85] Find common paths in program (profile) Realign basic blocks to form straight-line "traces" Basic-block: single-entry, single-exit insn sequence Trace: fused basic block sequence Schedule insns within a trace This is the easy part Create fixup code outside trace In case implicit trace path doesn't equal actual path Nasty Good scheduling needs ISA support for software speculation























```
Predication
· Conventional control
   · Conditionally executed insns also conditionally fetched

    Predication

   · Conditionally executed insns unconditionally fetched
   • Full predication (ARM A32, IA-64)
       • Can tag every insn with predicate, but extra bits in instruction
   • Conditional moves (Alpha, IA-32)
       • Construct appearance of full predication from one primitive
                                        // if (r1==0) r3=r2;
           cmoveg r1,r2,r3

    May require some code duplication to achieve desired effect

       + Only good way of adding predication to an existing ISA
• If-conversion: replacing control with predication
   + Good if branch is unpredictable (save mis-prediction)
   - But more instructions fetched and "executed"
CS/ECE 752 (Wood): Multiple Issue & Static Scheduling
```

```
ISA Support for Predication

0: ldf Y(r1),f2
1: fspne f2,p1
2: ldf.p.pl,w(r1),f2
4: stf.nppl,f0,Y(r1)
5: ldf X(r1),f4
6: mulf f4,f2,f6
7: stf f6,Z(r1)

• IA-64: change branch 1 to set-predicate insn fspne
• Change insns 2 and 4 to predicated insns
• ldf.p.performs ldf if predicate p1 is true
• stf.np performs stf if predicate p1 is false

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling 60
```



### Static Scheduling Summary

- Goal: increase scope to find more independent insns
- Loop unrolling
  - + Simple
  - Expands code size, can't handle recurrences or non-loops
- Software pipelining
  - · Handles recurrences
  - Complex prologue/epilogue code
  - · Requires register copies (unless rotating register file....)
- Trace scheduling
  - Superblocks and hyperblocks
  - + Works for non-loops
  - More complex, requires ISA support for speculation and predication

62

- Requires nasty repair code

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

Multiple Issue Summary

• Problem spots
• Wide fetch + branch prediction → trace cache?
• N² dependence cross-check
• N² bypass → clustering?

• Implementations
• Statically scheduled superscalar
• VLIW/EPIC
• Research: Grid Processor

• What's next:
• Finding more ILP by relaxing the in-order execution requirement



### **Loop Unrolling Shortcomings** - Static code growth more I\$ misses (relatively minor) - Poor scheduling along "seams" of unrolled copies - Need more registers to resolve WAR hazards - Doesn't handle recurrences (inter-iteration dependences) for (i=0;i<N;i++) X[i]=A\*X[I-1]; ldf X-4(r1),f1 ldf X-4(r1),f1 mulf f0,f1,f2 stf f2,X(r1) addi r1,4,r1 mulf f0,f1,f2 stf f2,X(r1) mulf f0,f2,f3 stf f3,X+4(r1) addi r1,4,r1 mulf f0.f1.f2 blt r1.r2.0 stf f2,X(r1) addi r1,4,r1 Two mulf's are not parallel blt r1.r2.0 CS/ECE 752 (Wood): Multiple Issue & Static Scheduling



### Software Pipelining

- Software pipelining: deals with these shortcomings
  - · Also called "symbolic loop unrolling" or "poly-cyclic scheduling"
  - Reinvented a few times [Charlesworth, '81], [Rau, '85] [Lam, '88]
  - · One physical iteration contains insns from multiple logical iterations
- The pipeline analogy
  - In a hardware pipeline, a single cycle contains...
  - Stage 3 of insn i, stage 2 of insn i+1, stage 1 of insn i+2
  - In a software pipeline, a single physical (SP) iteration contains...
    - Insn 3 from iter i, insn 2 from iter i+1, insn 1 from iter i+2

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

67

71

### Software Pipelined Recurrence Example

- Goal: separate mulf from stf
- · Physical iteration (box) contains
  - stf from original iteration i
  - ldf, mulf from original iteration i+1
  - Prologue: get pipeline started (ldf, mulf from iteration 0)
  - Epilogue: finish up leftovers (stf from iteration N-1)



### Software Pipelining Pipeline Diagrams

```
LM S
  LM S
```

- Same diagrams, new terminology
  - Across: cycles physical → iterations
  - Down: insns logical → iterations
- In the squares: stages → insns
- · How many physical software pipelined iterations?

  - . N: number of logical (original) iterations
  - K: number of logical iterations in one physical iteration

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

### Software Pipelined Example II

Vary software pipelining structure to tolerate more latency

· Example: physical iteration combines three logical iterations

ldf X(r1),f1 ldf X-4(r1),f1 mulf f0,f1,f2 stf f2,X(r1) addi r1,4,r1 blt r1,r2,0 ldf X(r1),f1 ldf X(r1),f1 stf f2,X-4(r1) ldf X+4(r1),f1 mulf f0,f1,f2 stf f2,X(r1) addi r1,4,r1 blt r1, r2,0 addi r1,4,r1 blt r1,r2,0 stf f2,X+4(r1) ldf X(r1),f1 stf f2, X+8(r1) mulf f0,f1,f2 stf f2,X(r1)

· Notice: no recurrence this time

addi r1.4.r1 · Can't software pipeline recurrence three times

blt r1, r2, 0
CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

### Software Pipelining Pipeline Diagram



- · Things to notice
  - · Within physical iteration (column)...
  - · Original iteration insns are in reverse order
  - That's OK, they are from different logical iterations
  - · And are independent of each other
  - + Perfect for VLIW/EPIC

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

### Software Pipelining

- + Doesn't increase code size
- + Good scheduling at iteration "seams"
- + Can vary degree of pipelining to tolerate longer latencies
  - · "Software super-pipelining"
  - One physical iteration: insns from logical iterations i, i+2, i+4
- Hard to do conditionals within loops
  - · Easier with loop unrolling

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

### Scheduling: Compiler or Hardware

- · Each has some advantages
- Compiler
  - + Potentially large scheduling scope (full program)
  - + Simple hardware → fast clock, short pipeline, and low power
  - Low branch prediction accuracy (profiling?)
  - Little information on memory dependences and latencies
     (appfilips2)
  - (profiling?)

     Pain to speculate and recover from mis-speculation (h/w support?)
- Hardware
  - + High branch prediction accuracy
  - + Dynamic information about memory dependences and latencies
  - + Easy to speculate and recover from mis-speculation
  - Finite buffering resources fundamentally limit scheduling scope
  - Scheduling machinery adds pipeline stages and consumes power

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

### Research: Frames

- New experimental scheduling construct: frame
  - rePLay [Patel+Lumetta]
  - Frame: an atomic superblock
  - Atomic means all or nothing, i.e., transactional
  - Two new insns
    - begin frame: start buffering insn results
    - commit frame: make frame results permanent
    - Hardware support required for buffering
  - · Any branches out of frame: abort the entire thing
  - + Eliminates nastiest part of trace scheduling ... nasty repair code
    - If frame path is wrong just jump to original basic block code
    - Repair code still exists, but it's just the original code

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

74



### Research: Grid Processor

- Grid processor architecture (aka TRIPS)
  - [Nagarajan, Sankaralingam, Burger+Keckler]
  - EDGE (Explicit Dataflow Graph Execution) execution model
  - Holistic attack on many fundamental superscalar problems
    - Specifically, the nastiest one: N2 bypassing
    - But also N2 dependence check
    - And wide-fetch + branch prediction
  - Two-dimensional VLIW
    - Horizontal dimension is insns in one parallel group
    - Vertical dimension is several vertical groups
  - Executes atomic hyperblocks
  - IBM looking into building it

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

76

### **Grid Processor** Components • next h-block logic/predictor (NH), I\$, D\$, regfile • NxN ALU grid: here 4x4 · Pipeline stages Fetch h-block to grid NH Read registers read Execute/memory Cascade ALU ALU ALU ALU · Write registers ALU ALU ALU ALU · Block atomic Į, No intermediate regs ALU ALU ALU ALU · Grid limits size/shape CS/ECE 752 (Wood): Multiple Issue & Static Scheduling

### **Grid Processor SAXPY**

| read | r2,0 | read | f1,0 | read | r1,0,1 | nop      |
|------|------|------|------|------|--------|----------|
| pass | 0    | pass | 1    | pass | -1,1   | ldf X,-1 |
| pass | 0    | pass | 0,1  | mulf | 1      | ldf Y,0  |
| pass | 0    | addi |      | pass | 1      | addf 0   |
| blt. |      | nop  |      | pass | 0.r1   | stf Z    |

- An h-block for this Grid processor has 5 4-insn words
  - The unit is all 5
- Some notes about Grid ISA
  - read: read register from register file
  - pass: null operation
  - -1,0,1: routing directives send result to next word
  - $\bullet$  one insn left (-1), insn straight down (0), one insn right (1)
  - Directives specify value flow, no need for interior registers

CS/ECE 752 (Wood): Multiple Issue & Static Scheduling











