Optimization

Overview
Peephole Optimization
- Test Yourself #1
Moving Loop-Invariant Computations
- Test Yourself #2
Strength Reduction in for Loops
- Test Yourself #3
- Test Yourself #4
Copy Propagation

Overview

The goal of optimization is to produce better code (fewer instructions, and, more importantly, code that runs faster). However, it is important not to change the behavior of the program (what it computes)!

We will look at the following ways to improve a program:

Peephole Optimization. This is done after code generation. It involves finding opportunities to improve the generated code by making small, local changes.
Moving Loop-Invariant Computations. This is done before code generation. It involves finding computations inside loops that can be moved outside, thus speeding up the execution time of the loop.
Strength-Reduction in for Loops. This is done before code generation. It involves replacing multiplications inside loops with additions. If it takes longer to execute a multiplication than an addition, then this speeds up the code.
Copy Propagation. This is done before code generation. It involves replacing the use of a variable with a literal or another variable. Copy propagation can sometimes uncover more opportunities for moving loop-invariant computations. It may also make it possible to remove some assignments from the program, thus making the code smaller and faster.

Peephole Optimization

The idea behind peephole optimization is to examine the code "through a small window," looking for special cases that can be improved. Below are some common optimizations that can be performed this way. Note that in all cases that involve removing an instruction, it is assumed that that instruction is not the target of a branch.

Remove a redundant load (fewer instructions generated, and fewer executed):
```
       store Rx, M	\	store Rx, M
       load  M, Rx	/
       
```

Remove a redundant push/pop (fewer instructions generated, and fewer executed):

       push Rx         \
       pop into Rx     /        ... nothing ...

Replace a jump to a jump (same number of instructions generated, but fewer executed):

           goto L1     \		  goto L2
             :	        >		    :
       L1: goto L2     /	      L1: goto L2

Remove a jump to the next instruction (fewer instructions generated, and fewer executed):
```
      goto L1    \                   L1: ...
  L1: ...        /
  
```

Replace a jump around jump (fewer instructions generated; possibly fewer executed):

          if T0 = 0 goto L1       \       if T0 != 0 goto L2
          else goto L2             >  L1: ......
       L1: ....                   /

Remove useless operations (fewer instructions generated and fewer executed):

       add T0, T0, 0    // changes nothing; remove it
       mul T0, T0, 1    // changes nothing; remove it

Reduction in strength (same number of instructions, but faster):

        mul T0, T0, 2       -> shift-left T0
        add T0, T0, 1       -> inc T0

Note that doing one optimization may enable another: for example:

       load  Tx, M     \       load  Tx, M   \  load  Tx, M
       add   Tx, 0      >      store Tx, M   /
       store Tx, M     /

TEST YOURSELF #1

Consider the following program:

public class Opt {

   public static void main() {
       int a;
       int b;

       if (true) {
         if (true) {
      	    b = 0;
         }
         else {
            b = 1;
         }
         return;
       }
       a = 1;
       b = a;
   }
}

Question 1: The code generated for this program contains opportunities for the first two kinds of peephole optimization (removing a redundant load, and replacing a jump to a jump). Can you explain how those opportunities arise just by looking at the source code?

Question 2: Below is the generated code. Verify your answer to question 1 by finding the opportunities for the two kinds of optimization. What other opportunity for removing redundant code is common in this example?

.text
	.globl main
main:		# FUNCTION ENTRY
	sw    $ra, 0($sp)	#PUSH
	subu  $sp, $sp, 4
	sw    $fp, 0($sp)	#PUSH
	subu  $sp, $sp, 4
	addu  $fp, $sp, 8
	subu  $sp, $sp, 8
	# STATEMENTS
	# if-then
	li    $t0, 1
	sw    $t0, 0($sp)	#PUSH
	subu  $sp, $sp, 4
	lw    $t0, 4($sp)	#POP
	addu  $sp, $sp, 4
	beq   $t0, 0, _L0
	# if-then-else
	li    $t0, 1
	sw    $t0, 0($sp)	#PUSH
	subu  $sp, $sp, 4
	lw    $t0, 4($sp)	#POP
	addu  $sp, $sp, 4
	beq   $t0, 0, _L1
	li    $t0, 0
	sw    $t0, 0($sp)	#PUSH
	subu  $sp, $sp, 4
	lw    $t0, 4($sp)	#POP
	addu  $sp, $sp, 4
	sw    $t0, -12($fp)
	b     _L2
_L1:
	li    $t0, 1
	sw    $t0, 0($sp)	#PUSH
	subu  $sp, $sp, 4
	lw    $t0, 4($sp)	#POP
	addu  $sp, $sp, 4
	sw    $t0, -12($fp)
_L2:
	# return
	b     main_Exit
_L0:
	li    $t0, 1
	sw    $t0, 0($sp)	#PUSH
	subu  $sp, $sp, 4
	lw    $t0, 4($sp)	#POP
	addu  $sp, $sp, 4
	sw    $t0, -8($fp)
	lw    $t0, -8($fp)
	sw    $t0, 0($sp)	#PUSH
	subu  $sp, $sp, 4
	lw    $t0, 4($sp)	#POP
	addu  $sp, $sp, 4
	sw    $t0, -12($fp)
		#FUNCTION EXIT
main_Exit:
	lw    $ra, 0($fp)
	move  $sp, $fp		#restore SP
	lw    $fp, -4($fp)	#restore FP
	jr    $ra		#return

Moving Loop-Invariant Computations Out of the Loop

The ideas behind this optimization are:

For greatest gain, optimize "hot spots" i.e., inner loops.
An expression is "loop invariant" if the same value is computed for that expression on every iteration of the loop.
So instead of computing the same value over and over, compute the value once outside the loop and reuse it.

Example:

for (i=0; i<100; i++) {
    for (j=0; j<100; j++) {
        for (k=0; k<100; k++) {
            A[i][j][k] = i*j*k
        }
    }
}

In this example, i*j is invariant with respect to the inner loop. But there are more loop-invariant expressions; to find them, we need to look at a lower-level version of this code. If we assume the following:

A is a 3D array
each element requires 4 bytes
elements are stored in the current activation record in row-major order (note: in Java, arrays are allocated from the heap, not stored on the stack; however, in other languages they may be stored on the stack)

then the code for A[i][j][k] = ... involves computing the address of A[i][j][k] (i.e., where to store the value of the right-hand-side expression). That computation looks something like:

address = FP + <offset of A> - (i*10,000*4)-(j*100*4)-(k*4)

So the code for the inner loop is actually something like:

T0 = i*j*k
T1 = FP + <offset of A> - i*40000 -j*400 - k*4
Store T0, 0(T1)

And we have the following loop-invariant expressions:

invariant to i loop: FP + <offset of A>

invariant to j loop: i*40000

invariant to k loop: i*j, j*400

We can move the computations of the loop-invariant expressions out of their loops, assigning the values of those expressions to new temporaries, and then using the temporaries in place of the expressions. When we do that for the example above, we get:

tmp0 = FP + <offset of A>
for (i=0; i<100; i++) {
    tmp1 = tmp0 - i*40000
    for (j=0; j<100; j++) {
        tmp2 = tmp1 - j*400
        tmp3 = i*j
        for (k=0; k<100; k++) {
            T0 = tmp3 * k         // T0 is i*j*k
            T1 = tmp2 - k*4       // T1 is the address of A[i][j][k]
            store T0, 0(T1)       // store i*j*k into A[i][j][k]
        }
    }
}

Here is a comparison of the original code and the optimized code (the number of instructions performed in the innermost loop, which is executed 1,000,000 times):

Original Code	New Code
5 multiplications (3 for lvalue, 2 for rvalue) 3 subtractions (for lvalue) 1 indexed store	2 multiplications(1 for lvalue, 1 for rvalue) 1 subtraction (for lvalue) 1 indexed store

Questions:

How do we recognize loop-invariant expressions?
When and where do we move the computations of those expressions?

Answers:

An expression is invariant with respect to a loop if for every operand, one of the following holds:
1. It is a literal, or
2. It is a variable that gets its value only from outside the loop.
To answer question 2, we need to consider safety and profitability.

Safety:

error

might not


b = a;
while (a != 0) {
    x = 1/b; // Possible divide by zero if 1/b is moved out of the loop
    a--;
}

order

Profitability:

not

Moving a computation is both safe and profitable if one of the following holds:

It can be determined that the loop will execute at least once and the code is guaranteed to execute if the loop does:
- it isn't inside any condition, or
- it is on all paths through the loop (e.g., it occurs in both branches of an if-then-else).
The expression is in (a non short-circuited) part of the loop test / loop bounds, e.g.:

TEST YOURSELF #2

What are some examples of loops for which the compiler can be sure that the loop will execute at least once?

Optimization #2: Strength reduction in for loops

Idea:

Again, concentrate on "hot spots"
Replace expensive operations (*) with cheaper ones (+)

Given a loop of the form: "for i from low to high do ..."

An induction expression is an expression that is used inside the loop, and has the form: "i * k1 + k2"

where "i" is the loop index, and "k1" and "k2" are constant with respect to the loop.

Consider the sequences of values for variable i and for the induction expression:

iteration #	i	i * k1 + k2
1	low	low * k1 + k2
2	low + 1	(low + 1) * k1 + k2 = *low k1** + k1 + k2
3	low + 1 + 1	(low + 1 + 1) * k1 + k2 = *low k1 + k1** + k1 + k2

Note that in each case, the part of the induction expression shown in bold is the same as the value of the whole expression on the previous iteration, and the non-bold part each time is just + k1. In other words, each time around the loop, the induction expression increases by adding k1, a constant value! So we can avoid doing the multiplication each time around the loop by:

computing low * k1 + k2 once before the loop,
storing that value in a temporary,
using the temporary instead of the expression inside the loop, and
incrementing the temporary by k1 at the end of the loop.

Here is the transformed loop:

temp = low * k1 + k2   // initialize temp to first value of induction exp
for i from low to high do
    ... temp ...      // Use temp in place of i * k1 + k2
    temp = temp + k1  // increment temp at the end of the loop by adding k1
end

Note that instead of doing a multiplication and an addition each time around the loop, we now do just one addition each time. Although in this example we've removed a multiplication, in general we are replacing a multiplication with an addition (that is why this optimization is called reduction in strength). In particular, if there were no k2, the original induction expression would be: i * k1, and that would be replaced inside the loop by: temp = temp + k1; an addition replaces a multiplication.

TEST YOURSELF #3

Some languages actually have for-loops with the syntax used above (for i from low to high do ...), but other languages (including Java) do not use that syntax. Must a Java compiler give up on performing this optimization, or might it be able to recognize opportunities in some cases?

Now let's see how to apply this optimization to the example code we used to illustrate moving loop-invariant computations out of the loop. Below is the code we had after moving the loop-invariant computations; the induction expressions are shown in bold red:

tmp0 = FP + offset A
for (i=0; i<100; i++) {
    tmp1 = tmp0 - i*40000      // i * -40000 + tmp0
    for (j=0; j<100; j++) {
        tmp2 = tmp1 - j*400    // j * -400 + tmp1
        tmp3 = i*j             // j * i + 0
        for (k=0; k<100; k++) {
            T0 = tmp3 * k    // k * tmp3 + 0
            T1 = tmp2 - k*4  // k * -4 + tmp2
            store T0, 0(T1)
        }
    }
}

After performing the reduction in strength optimizations:

tmp0 = FP + offset A
temp1 = tmp0               // temp1 = 0*-40000+tmp0
for (i=0; i<100; i++) {
    tmp1 = temp1
    temp2 = tmp1           // temp2 = 0*-400+tmp1
    temp3 = 0              // temp3 = 0*i+0
    for (j=0; j<100; j++) {
        tmp2 = temp2
        tmp3 = temp3
	temp4 = 0          // temp4 = 0*tmp3+0
	temp5 = tmp2       // temp5 = 0*-4+tmp2
        for (k=0; k<100; k++) {
            T0 = temp4
            T1 = temp5
            store T0, 0(T1)
	    temp4 = temp4 + tmp3
	    temp5 = temp5 - 4
        }
	temp2 = temp2 - 400
	temp3 = temp3 + i
    }
    temp1 = temp1 - 40000
}

In the original code, the innermost loop (executed 1,000,000 times) had two multiplications and one subtraction. In the optimized code, the inner loop has no multiplications, one subtraction, and one addition. (Similarly, the middle loop went from two multiplications and one subtraction to no multiplications, one subtraction, and one addition; the outer loop went from one multiplication and one subtraction to no multiplications and one subtraction.) On the other hand, we have added a number of assignments; for example, the inner loop had just two assignments, and now it has four. To see how to deal with that, read the next section, about copy propagation!

TEST YOURSELF #4

Suppose that the index variable is incremented by something other than one each time around the loop. For example, consider a loop of the form:

for (i=low; i<=high; i+=2) ...

Can strength reduction still be performed? If yes, what changes must be made to the proposed algorithm?

Optimization #3: Copy propagation

Statements of the form x = y (call this definition "d") are called copy statements. For every use u of variable x reached by definition "d" such that:

no other definition of x reaches u, and
y can't change between d and u

we can replace the use of x at u with a use of y.

Examples:

x = y                     x = y                           x = y
a = x + z // x can be     if (...) x = 2                  if (...) y = 3
          // replaced     a = x + z  // x cannot be       a = x + z  // x cannot
          // with y                  // replaced with                // be replaced
				     // y; violates                  // with y;
				     // condition 1                  // violates
				                                     // condition 2

Question: Why is this a useful transformation?

Answers:

If all uses of x reached by d are replaced, then definition d is useless, and can be removed.
Even if the definition cannot be removed, copy propagation can lead to improved code:
1. If the definition is actually of the form: x = literal, then copy propagation can create opportunities for better code:
```
	    x = 5;       ==>   x = 5;
	    a = b + x;         a = b + 5;
	    
```
  For a machine like the MIPS, that allows an operand to be an "immediate" (literal) value but not a memory location, better code can be generated for the transformed version (that uses a = b + 5) than for the original version (that uses a = b + x), because we can avoid loading the value of x into a register. Furthermore, this kind of copy propagation can lead to opportunities for constant folding: evaluating, at compile time, an expression that involves only literals. For example:
```
	    x = 5;       ==>   x = 5;      ==>  x = 5;
	    a = 3 + x;         a = 3 + 5;       a = 8;
	    
```
2. Sometimes copy propagation can be combined with moving loop-invariant computations out of the loop, to lead to a better overall optimization. For example:
```
	   while (...) {
	      x = a * b; // loop-inv
	      y = x * c;
	      ...
	   }
	   
```
  Move "a * b" out of the loop:
```
	   tmp1 = a * b;
	   while (...) {
	      x = tmp1;
	      y = x * c;
	      ...
	   }
	   
```
  Note that at this point, even if c is not modified in the loop, we cannot move "x * c" out of the loop, because x gets its value inside the loop. However, after we do copy propagation:
```
	   tmp1 = a * b;
	   while (...) {
	      x = tmp1;
	      y = tmp1 * c;
	      ...
	   }
	   
```
  "tmp1 * c" can also be moved out of the loop:
```
	   tmp1 = a * b;
	   tmp2 = tmp1 * c;
	   while (...) {
	      x = tmp1;
	      y = tmp2;
	      ...
	   }
	   
```

Given a definition d that is a copy statement: x = y, and a use u of x, we must determine whether the two important properties hold that permit the use of x to be replaced with y.

The first property (use u is reached only by definition d) is best solved using the standard "reaching-definitions" dataflow-analysis problem, which computes, for each definition of a variable x, all of the uses of x that might be reached by that definition. Note that this property can also be determined by doing a backward depth-first or breadth-first search in the control-flow graph, starting at use u, and terminating a branch of the search when a definition of x is reached. If definition d is the only definition encountered in the search, then it is the only one that reaches use u. (This technique will, in general, be less efficient than doing reaching-definitions analysis.)

The second property (that variable y cannot change its value between definition d and use u), can also be verified using dataflow analysis, or using a backwards search in the control-flow graph starting at u, and quitting at d. If no definition of y is encountered during the search, then its value cannot change, and the copy propagation can be performed. Note that when y is a literal, property 2 is always satisfied.

Below is our running example (after doing reduction in strength). The copy statements are shown in bold red font. In this particular example, each variable x that is defined in a copy statement reaches only one use. Comments indicate which of them cannot be propagated (because of a violation of property (1) -- in this example there are no instances where property (2) is violated).

tmp0 = FP + offset A
temp1 = tmp0            // cannot be propagated
for (i=0; i<100; i++) {
    tmp1 = temp1
    temp2 = tmp1        // cannot be propagated
    temp3 = 0           // cannot be propagated
    for (j=0; j<100; j++) {
        tmp2 = temp2
        tmp3 = temp3
	temp4 = 0       // cannot be propagated
	temp5 = tmp2    // cannot be propagated
        for (k=0; k<100; k++) {
            T0 = temp4
            T1 = temp5
            store T0, 0(T1)
	    temp4 = temp4 + tmp3
	    temp5 = temp5 - 4
        }
	temp2 = temp2 - 400
	temp3 = temp3 + i
    }
    temp1 = temp1 - 40000
}

And here's the code after propagating the copies that are legal, and removing the copy statements that become dead. Note that we are able to remove 5 copy statements, including 2 from the innermost loop.

tmp0 = FP + offset A
temp1 = tmp0
for (i=0; i<100; i++) {
    temp2 = temp1
    temp3 = 0
    for (j=0; j<100; j++) {
	temp4 = 0
	temp5 = temp2
        for (k=0; k<100; k++) {
            store temp4 0(temp5)
	    temp4 = temp4 + temp3
	    temp5 = temp5 - 4
        }
	temp2 = temp2 - 400
	temp3 = temp3 + i
    }
    temp1 = temp1 - 40000
}