# **Reading Assignment**

- Read Goodman and Hsu's paper, "Code Scheduling and Register Allocation in Large Basic Blocks."
- Read Bernstein and Rodeh's paper, "Global Instruction Scheduling for Superscalar Machines." (Linked from the class Web page.)

#### Gibbons & Muchnick Postpass Code Scheduler

- 1. If there is only one root, schedule it.
- 2. If there is more than one root, choose that root that won't be stalled by instructions already scheduled.
- If more than one root can be scheduled without stalling, consider the following rules (in order);
  - (a) Does this root stall any of its successors?
     (If so, schedule it immediately.)
  - (b) How many new roots are exposed if this node is scheduled? (More is better.)

 (c) Which root has the longest weighted path to a leaf (using instruction delays as the weight). (The "critical path" in the DAG gets priority.)

#### Example



#### **False Dependencies**

We still have delays in the schedule that was produced because of "false dependencies."

Both ъ and с are loaded into %r2. This limits the ability to move the load of с prior to any use of %r2 that uses ъ.

To improve our schedule we can use a processor that renames registers *or* allocate additional registers to remove false dependencies.

## **Register Renaming**

Many out of order processors automatically rename distinct uses of the same architectural register to distinct internal registers.

```
Thus

ld [a],%r1

ld [b],%r2

add %r1,%r2,%r1

ld [c],%r2

is executed as if it were
```

```
ld [a],%r1
ld [b],%r2
add %r1,%r2,%r3
ld [c],%r4
Now the final load can be executed
prior to the add, eliminating a stall.
```

## **Compiler Renaming**

A compiler can also use the idea of renaming to avoid unnecessary stalls.

An extra register may be needed (as was the case for scheduling expression trees).

Also, a *round-robin* allocation policy is needed. Registers are reused in a *cyclic* fashion, so that the most recently freed register is reused last, not first.

#### Example



#### After Scheduling:

4. ld [C],%r3 //Longest path 5. **1d** [d],%r4 //Exposes a root 1. ld [a], %r1 //Stalls succ. 2. ld [b], %r2 //Exposes a root 6. smul %r3,%r4,%r5 //Stalls succ. 8. add %r3,%r4,%r3 //Longest path 9. smul %r3,%r4,%r3 //Stalls succ. 3. add %r1,%r2,%r1 //Only choice 7. add %r1,%r5,%r2 //Only choice 10. add %r2,%r3,%r2 //Only choice 11. st %r2,[a] (0 Stalls Total) 8 6 6 8 6 3 8 2 3 10 1

# **Balanced Scheduling**

When scheduling a load, we normally anticipate the *best* case, a hit in the primary cache.

On older architectures this makes sense, since we stall execution on a cache miss.

Many newer architectures are *non-blocking*. This means we can continue execution after a miss until the loaded value is used.

Assume a Cache miss takes N cycles (N is typically 10 or more).

Do we schedule a load anticipating a 1 cycle delay (a hit) or an N cycle delay (a miss)?





#### **Balance Placement of Loads**

Eggers suggests a *balanced scheduler* that spaces out loads, using available independent instructions as "filler."

The insight is that scheduling should not be driven by worst-case latencies but rather by available *Independent* Instructions.

For





# Idea of the Algorithm

Look at each Instruction, i, in the Dependency DAG.

Determine which loads can run in parallel with i and use all (or part) of i's execution time to cover the latency of these loads. Compute available latency of each load:

Give each load instruction an initial latency of 1.

For (each instruction i in the Dependency DAG) do:

# Consider Instructions Independent of i:

 $G_{ind} = DepDAG -$ 

(AllPred(i) U AllSucc(i) U {i})

For (each connected subgraph c in G<sub>ind</sub>) do:

Find m = maximum number of load instructions on any path in c.

For (each load d in c) do: add 1/m to d's latency.

# Computing the Schedule Using Adjusted Latencies

Once latencies are assigned to each load (other instructions have a latency of 1), we annotate each instruction in the Dependency DAG with its critical path weight: the maximum latency (along any path) from the instruction to a Leaf of the DAG.

Instructions are scheduled using critical path values; the root with the highest critical path value is always scheduled next. In cases of ties (same critical path value), operations with the longest latency are scheduled first.



# Using the annotated Dependency Dag, instructions can be scheduled:

