Overview

The goal of optimization is to produce better code (fewer instructions, and, more importantly, code that runs faster). However, it is important not to change the behavior of the program (what it computes)!

We will look at the following ways to improve a program:

  1. Peephole Optimization. This is done after code generation. It involves finding opportunities to improve the generated code by making small, local changes.
  2. Moving Loop-Invariant Computations. This is done before code generation. It involves finding computations inside loops that can be moved outside, thus speeding up the execution time of the loop.
  3. Strength-Reduction in for Loops. This is done before code generation. It involves replacing multiplications inside loops with additions. If it takes longer to execute a multiplication than an addition, then this speeds up the code.
  4. Copy Propagation. This is done before code generation. It involves replacing the use of a variable with a literal or another variable. Copy propagation can sometimes uncover more opportunities for moving loop-invariant computations. It may also make it possible to remove some assignments from the program, thus making the code smaller and faster.
Peephole Optimization

The idea behind peephole optimization is to examine the code "through a small window," looking for special cases that can be improved. Below are some common optimizations that can be performed this way. Note that in all cases that involve removing an instruction, it is assumed that that instruction is not the target of a branch.

  1. Remove a redundant load (fewer instructions generated, and fewer executed):
    image/svg+xml after peepholeoptimization store Rx, Mload M, Rx store Rx, M
  2. Remove a redundant push/pop (fewer instructions generated, and fewer executed):
    image/svg+xml after peepholeoptimization push Rxpop into Rx nothing!
  3. Replace a jump to a jump (same number of instructions generated, but fewer executed):
    image/svg+xml after peepholeoptimization goto L1L1: goto L2 ... goto L2L1: goto L2 ...
  4. Remove a jump to the next instruction (fewer instructions generated, and fewer executed):
    image/svg+xml after peepholeoptimization goto L1L1: ... L1: ...
  5. Replace a jump around jump (fewer instructions generated; possibly fewer executed):
    image/svg+xml after peephole optimization if T0 == 0 goto L1 goto L2L1: ... if T0 != 0 goto L2L1: ...
  6. Remove useless operations (fewer instructions generated and fewer executed):
    image/svg+xml after peepholeoptimization add T0,T0, 0mul T0,T0, 1 nothing!(Adding 0 or multiplying by 1 has no effect:these instructions are useless)
  7. Reduction in strength: don't use a slow, general-purpose instruction where a fast, special-purpose instruction will do (same number of instructions, but faster):
    image/svg+xml after peepholeoptimization mul, T0, T0, 2 shift-left T0 add, T0, T0, 1 inc T0
Note that doing one optimization may enable another: for example:
image/svg+xml load Tx, Madd Tx, 0store Tx, M load Tx, Mstore Tx, M load Tx, M after round 1 after round 2


TEST YOURSELF #1

Consider the following program:

Question 1: The code generated for this program contains opportunities for the first two kinds of peephole optimization (removing a redundant load, and replacing a jump to a jump). Can you explain how those opportunities arise just by looking at the source code?

Question 2: Below is the generated code. Verify your answer to question 1 by finding the opportunities for the two kinds of optimization. What other opportunity for removing redundant code is common in this example?

We can move the computations of the loop-invariant expressions out of their loops, assigning the values of those expressions to new temporaries, and then using the temporaries in place of the expressions. When we do that for the example above, we get:

image/svg+xml tmp0 = FP - offsetAfor (i=0; i<100; i++){ tmp1 = tmp0 + i*40000 for (j=0; j<100; j++){ tmp2 = tmp1 + j*400 temp = i*j for (k=0; k<100; k++){ T0 = temp * k T1 = tmp2 + k*4 store T0, 0(T1) } }} T0 is i*j*k T1 is the address of A store i*j*k intoA[i][j][k]

Here is a comparison of the original code and the optimized code (the number of instructions performed in the innermost loop, which is executed 1,000,000 times):

Original Code New Code
5 multiplications (3 for lvalue, 2 for rvalue) 2 multiplications (1 for lvalue, 1 for rvalue)
1 subtraction; 3 additions (for lvalue) 1 addition (for lvalue)
1 indexed store 1 indexed store

Questions:

  1. How do we recognize loop-invariant expressions?
  2. When and where do we move the computations of those expressions?
Answers:
  1. An expression is invariant with respect to a loop if for every operand, one of the following holds:
    1. It is a literal, or
    2. It is a variable that gets its value only from outside the loop.

  2. To answer question 2, we need to consider safety and profitability.
Safety

If evaluating the expression might cause an error, then there is a possible problem if the expression might not be executed in the original, unoptimized code. For example:

image/svg+xml b = a;while (a != 0){ x = 1/b; a--;} possible divide by zero if moved out of the loop

What about preserving the order of events? e.g. if the unoptimized code performed output then had a runtime error, is it valid for the optimized code to simply have a runtime error? Also note that changing the order of floating-point computations may change the result, due to differing precisions.

Profitability

If the computation might not execute in the original program, moving the computation might actually slow the program down!

Moving a computation is both safe and profitable if one of the following holds:

  1. It can be determined that the loop will execute at least once and the code is guaranteed to execute if the loop does:
  2. The expression is in (a non short-circuited) part of the loop test / loop bounds, e.g.:


TEST YOURSELF #2

What are some examples of loops for which the compiler can be sure that the loop will execute at least once?

solution


Optimization #2: Strength reduction in for-loops

The basic idea here is to take advantage of patterns in for-loops to replace expensive operations, like multiplications, with cheaper ones, like additions.

The particular pattern that we will handle takes the general form of a loop where:

  1. L is the loop index
  2. B is the beginning value of the loop
  3. E is the end value of the loop
  4. The body of the loop contains a right-hand-side expression of the form L * M + C. We call this the induction expression.
  5. The factors of the induction expression, M and C, must be constant with respect to the loop.
These rules define a sort-of "template" of the following form* The Ackermann function is famously slow to compute. In this example, the resultant call will return a number with nearly 20,000 digits. :

for L from B to E do {
        $\vdots$
   $\ldots$ = L * M + C
        $\vdots$
}

Consider the sequences of values for L and for the induction expression: 

Iteration # L L * M + C
1 B B * M + C
2 B + 1 (B + 1) * M + C = B * M + M + C
3 B + 1 + 1 (B + 1 + 1) * M + C = B * M + M + M + C
Note that in each case, the part of the induction expression highlighted in orange is the same as the value of the whole expression on the previous iteration, and the non-highlighted part each time is always   + M. In other words, each time around the loop, the induction expression increases by adding M, a constant value! So we can avoid doing the multiplication each time around the loop by: Here is the transformed loop: