Kurlander solved this problem by mapping it to a known problem in Integer Programming: the Dual Network Flow Problem. Solution techniques for this problem are well known—libraries of standard solution algorithms exist. Moreover, this problem can be solved in polynomial time. That is, it is "easier" than optimal global (intraprocedural) register allocation, which is NP-complete! #### **Adding Saves & Restores** Wall designed his save-free interprocedural allocator for a machine with 52 registers. Most computers have far fewer registers, and hence saving and restoring across calls, when profitable, should be allowed. Kurlander's Technique can be extended to include save/restore costs. If the cost of saving and restoring is *less* than the benefit of allocating an extra register, saving is done. Moreover, saving is done where it is *cheapest* (not closest!). CS 701 Fall 2014 365 CS 701 Fall 2014 ... #### **Example** ``` main() { ... p(); ...} p(){ ... for (i=0; i<1000000; i++){ q(): } }</pre> ``` We first allocate registers in a savefree mode. After all Registers have been allocated, q may need additional registers. Most allocators would add save/ restore code at q's call site (or q's prologue and epilogue). An optimal allocator will place save/ restore code at p's call site, freeing a register that p doesn't even want (but that q does want!) ### **Extending the Cost Model** - As before, we group register candidates of different subprograms into registers. - Now only candidates within the same subprogram automatically interfere. - Saves are placed on the edges of the call graph. - · We aim to solve $$\begin{array}{ccc} \textbf{Maximize} & \sum_{\substack{k \\ c_j \ \in \ \bigcup_{i=1}^{k} R_i}} & - & \sum_{\substack{e_m \in \ \text{Edges}}} s_m * \text{Saved}_m \end{array}$$ where $s_m$ is the per/register save/ restore cost and $saved_m$ is the number of registers saved on edge $e_m$ . - As registers are saved, they may be reused in child subprograms. - This optimization problem can be solved as a Network Dual Flow Problem. - Again, the solution algorithm is polynomial. # **Example (One Register)** P<sub>1</sub> gets R1 save-free for m. A save on $P_1 \rightarrow P_4$ costs 1 and gives a register to n (net profit =2), so we do it. A save on $P_1 \rightarrow P_2$ for z costs 2, and yields 1, which isn't profitable. A save on $P_2 \rightarrow P_3$ for q costs 4, and yields 3, which isn't profitable. A save on $P_1 \rightarrow P_2$ for q costs 2, and yields 3, which is a net gain. CS 701 Fall 2014® 9 CS 701 Fall 2014® 370 ### **Handling Global Variables** - Wall's technique handled globals by assuming they interfere with all subprograms and all other globals. - Kurlander's approach is incremental (and non-optimal): - First, an optimal allocation for r registers is computed. Next, one register is "stolen" and assigned interprocedurally to the most beneficial global. (Subprograms that don't use the global can save and restore it locally, allowing local reuse). An optimal allocation using R-1 registers is computed. If this solution plus the shared global is more profitable than the R register solution, the global allocation is "locked in." Next, another register is "stolen" for a global, leaving R-2 for interprocedural allocation. This process continues until stealing another register for a global isn't profitable. # Why is Optimal Interprocedural Register Allocation Easier than Optimal IntraProcedural Allocation? This result seems counter-intuitive. How can allocating a whole program be *easier* (computationally) than allocating only one subprogram. Two observations provide the answer: - Interprocedural allocation assumes some form of local allocation has occurred (to identify register candidates). - Interprocedural interference is transitive (if A interferes with B and B interferes with C then A interferes with B). But intraprocedural interference isn't transitive! #### **Reading Assignment** - Read Section 15.4 (Code Scheduling) of Crafting a Compiler. - Read Gibbon's and Muchnick's paper, "Efficient Instruction Scheduling for a Pipelined Architecture." - Read Kerns and Eggers' paper, "Balanced Scheduling: Instruction Scheduling When Memory Latency is Uncertain." CS 701 Fall 2014 S 373 CS 701 Fall 2014 S 3 ### **Code Scheduling** Modern processors are pipelined. They give the impression that all instructions take unit time by executing instructions in *stages* (steps), as if on an assembly line. Certain instructions though (loads, floating point divides and square roots, delayed branches) take more than one cycle to execute. These instructions may *stall* (halt the processor) or require a nop (null operation) to execute properly. A Code Scheduling phase may be needed in a compiler to avoid stalls or eliminate nops. #### **Scheduling Expression DAGs** After generating code for a DAG or basic block, we may wish to schedule (reorder) instructions to reduce or eliminate stalls. A Postpass Scheduler is run after code selection and register allocation. Postpass schedulers are very general and flexible, since they can be used with code generated by any compiler with any degree of optimization But, since they can't modify register allocations, they can't always avoid stalls. #### **Dependency DAGs** Obviously, not all reorderings of generated instructions are acceptable. Computation of a register value must precede all uses of that value. A store of a value must precede all loads that might read that value. A Dependency Dag reflects essential ordering constraints among instructions: - · Nodes are Instructions to be scheduled. - An arc from Instruction i to Instruction j indicates that i must be executed before j may be executed. **Kinds of Dependencies** We can identify several kinds of dependencies: · True Dependence: An operation that uses a value has a true dependence (also called a flow dependence) upon an earlier operation that computes the value. For example: Anti Dependence: An operation that writes a value has a anti dependence upon an earlier operation that reads the value. For example: CS 701 Fall 2014<sup>©</sup> 377 CS 701 Fall 2014 ® 378 #### Output Dependence: An operation that writes a value has a output dependence upon an earlier operation that writes the value. For example: ``` mov 1, %12 mov 2, %12 ``` Collectively, true, anti and output dependencies are called data dependencies. Data dependencies constrain the order in which instructions may be executed. # Example Consider the code that might be generated for ``` a = ((a+b) + (c*d)) + ((c+d) * d); ``` We'll assume 4 registers, the minimum possible, and we'll reuse already loaded values. Assume a 1 cycle stall between a load and use of the loaded value and a 2 cycle stall between a multiplication and first use of the product. CS 701 Fall 2014® #### **Scheduling Requires Topological Traversal** Any valid code schedule is a **Topological Sort of the dependency** dag. To create a code schedule you - (1) Pick any root of the Dag. - (2) Remove it from the Dag and schedule it. - (3) Iterate! Choosing a Minimum Delay schedule is NP-Complete: "Computers and Intractability," M. Garey and D. Johnson, W.H. Freeman, 1979. CS 701 Fall 2014® #### **Dynamically Scheduled** (Out of Order) Processors To avoid stalls, some processors can execute instructions Out of Program Order. If an instruction can't execute because a previous instruction it depends upon hasn't completed yet, the instruction can be "held" and a successor instruction executed instead. When needed predecessors have completed, the held instruction is released for execution. #### **Example** [a], %r1 [b], %r2 [d], %r3 1. ld 2. ld 5. ld 3. add %r1,%r2,%r1 4. ld [c], %r2 Stall 6. smul %r2,%r3,%r4 8. add %r2,%r3,%r2 9. smul %r2,%r3,%r2 7. add %r1,%r4,%r1 Stall 10. add %r1,%r2,%r1 11. st %r1,[a] (2 Stalls Total) #### **Limitations of Dynamic Scheduling** - 1. Extra processor complexity. - 2. Register renaming (to avoid False Dependencies) may be needed. - 3. Identifying instructions to be delayed takes time. - 4. Instructions "late" in the program can't be started earlier. **Postpass Code Scheduler** Gibbons & Muchnick - 1. If there is only one root, schedule it. - 2. If there is more than one root. choose that root that won't be stalled by instructions already scheduled. - 3. If more than one root can be scheduled without stalling, consider the following rules (in order); - (a) Does this root stall any of its successors? (If so, schedule it immediately.) - (b) How many new roots are exposed if this node is scheduled? (More is better.) CS 701 Fall 2014® CS 701 Fall 2014 (c) Which root has the longest weighted path to a leaf (using instruction delays as the weight). (The "critical path" in the DAG gets priority.) #### **Example** [a], %r1 //Longest path [b], %r2 //Exposes a root [d], %r3 //Not delayed %r1,%r2,%r1 //Only choice 1. ld 2. ld 5. ld 3. add [c], %r2 //Only choice 4. ld 6. smul %r2,%r3,%r4 7/Stalls succ. %r2,%r3,%r2 //Not delayed 8. add 9. smul %r2,%r3,%r2 //Not delayed %r1,%r4,%r1 //Only choice %r1,%r2,%r1 //Only choice 7. add 10. add 11. st %r1,[a] (2 Stalls Total) #### **False Dependencies** We still have delays in the schedule that was produced because of "false dependencies." Both b and c are loaded into %r2. This limits the ability to move the load of c prior to any use of %r2 that uses b. To improve our schedule we can use a processor that renames registers or allocate additional registers to remove false dependencies. #### **Register Renaming** Many out of order processors automatically rename distinct uses of the same architectural register to distinct internal registers. #### Thus ``` ld [a],%r1 ld [b],%r2 add %r1,%r2,%r1 ld [c],%r2 ``` is executed as if it were ``` ld [a],%r1 ld [b],%r2 add %r1,%r2,%r3 ld [c],%r4 ``` Now the final load can be executed prior to the add, eliminating a stall. CS 701 Fall 2014<sup>©</sup> 389 CS 701 Fall 2014<sup>©</sup> 390 ### **Compiler Renaming** A compiler can also use the idea of renaming to avoid unnecessary stalls. An extra register may be needed (as was the case for scheduling expression trees). Also, a round-robin allocation policy is needed. Registers are reused in a cyclic fashion, so that the most recently freed register is reused last, not first. #### **Example** ``` 1. ld [a], %r1 2. 1d [b], %r2 Stall 3. add %r1,%r2,%r1 4. ld [c], %r3 5. ld [d], %r4 Stall 6. smul %r3,%r4,%r5 Stall*2 7. add %r1,%r5,%r2 8. add %r3,%r4,%r3 9. smul %r3,%r4,%r3 - Stall*2 10. add %r2,%r3,%r2 11. st %r2,[a] (6 Stalls Total) 3 ``` ## **After Scheduling:** ``` [c], %r3 //Longest path [d], %r4 //Exposes a ro [a], %r1 //Stalls succ. [b], %r2 //Exposes a ro 4. ld 5. ld //Exposes a root 1. ld //Stalls succ. 2. ld //Exposes a root ``` 6. smul %r3,%r4,%r5 //Stalls succ. 8. add %r3,%r4,%r3 //Longest path //Stalls succ. 9. smul %r3,%r4,%r3 3. add %r1,%r2,%r1 //Only choice 7. add %r1,%r5,%r2 //Only choice 10. add %r2,%r3,%r2 //Only choice 11. st %r2,[a] (0 Stalls Total) CS 701 Fall 2014 $^{\odot}$ 393 CS 701 Fall 2014 $^{\odot}$