The goal of optimization is to produce better code (fewer instructions,
and, more importantly, code that runs faster).
However, it is important not to change the behavior of the program (what
it computes)!
We will look at the following ways to improve a program:
The idea behind peephole optimization is to examine the code "through a
small window," looking for special cases that can be improved.
Below are some common optimizations that can be performed this way.
Note that in all cases that involve removing an instruction,
it is assumed that that instruction is not the target of a branch.
Overview
Peephole Optimization
Note that doing one optimization may enable another: for example:
store Rx, M \ store Rx, M
load M, Rx /
push Rx \
pop into Rx / ... nothing ...
goto L1 \ goto L2
: > :
L1: goto L2 / L1: goto L2
goto L1 \ L1: ...
L1: ... /
if T0 = 0 goto L1 \ if T0 != 0 goto L2
else goto L2 > L1: ......
L1: .... /
add T0, T0, 0 // changes nothing; remove it
mul T0, T0, 1 // changes nothing; remove it
mul T0, T0, 2 -> shift-left T0
add T0, T0, 1 -> inc T0
load Tx, M \ load Tx, M \ load Tx, M
add Tx, 0 > store Tx, M /
store Tx, M /
Consider the following program:
public class Opt {
public static void main() {
int a;
int b;
if (true) {
if (true) {
b = 0;
}
else {
b = 1;
}
return;
}
a = 1;
b = a;
}
}
Question 1:
The code generated for this program contains opportunities for the first two
kinds of peephole optimization (removing a redundant load, and replacing
a jump to a jump).
Can you explain how those opportunities arise just by looking at the
source code?
Question 2: Below is the generated code. Verify your answer to question 1 by finding the opportunities for the two kinds of optimization. What other opportunity for removing redundant code is common in this example?
.text .globl main main: # FUNCTION ENTRY sw $ra, 0($sp) #PUSH subu $sp, $sp, 4 sw $fp, 0($sp) #PUSH subu $sp, $sp, 4 addu $fp, $sp, 8 subu $sp, $sp, 8 # STATEMENTS # if-then li $t0, 1 sw $t0, 0($sp) #PUSH subu $sp, $sp, 4 lw $t0, 4($sp) #POP addu $sp, $sp, 4 beq $t0, 0, _L0 # if-then-else li $t0, 1 sw $t0, 0($sp) #PUSH subu $sp, $sp, 4 lw $t0, 4($sp) #POP addu $sp, $sp, 4 beq $t0, 0, _L1 li $t0, 0 sw $t0, 0($sp) #PUSH subu $sp, $sp, 4 lw $t0, 4($sp) #POP addu $sp, $sp, 4 sw $t0, -12($fp) b _L2 _L1: li $t0, 1 sw $t0, 0($sp) #PUSH subu $sp, $sp, 4 lw $t0, 4($sp) #POP addu $sp, $sp, 4 sw $t0, -12($fp) _L2: # return b main_Exit _L0: li $t0, 1 sw $t0, 0($sp) #PUSH subu $sp, $sp, 4 lw $t0, 4($sp) #POP addu $sp, $sp, 4 sw $t0, -8($fp) lw $t0, -8($fp) sw $t0, 0($sp) #PUSH subu $sp, $sp, 4 lw $t0, 4($sp) #POP addu $sp, $sp, 4 sw $t0, -12($fp) #FUNCTION EXIT main_Exit: lw $ra, 0($fp) move $sp, $fp #restore SP lw $fp, -4($fp) #restore FP jr $ra #return
The ideas behind this optimization are:
Example:
So the code for the inner loop is actually something like:
We can move the computations of the loop-invariant expressions out of
their loops, assigning the values of those expressions to new temporaries,
and then using the temporaries in place of the expressions.
When we do that for the example above, we get:
Here is a comparison of the original code and the optimized code (the number
of instructions performed in the innermost loop, which is executed
1,000,000 times):
Questions:
Safety:
Profitability:
Moving a computation is both safe and profitable if one of the
following holds:
Moving Loop-Invariant Computations Out of the Loop
for (i=0; i<100; i++) {
for (j=0; j<100; j++) {
for (k=0; k<100; k++) {
A[i][j][k] = i*j*k
}
}
}
In this example, i*j
is invariant with respect to the inner loop.
But there are more loop-invariant expressions; to find them,
we need to look at a lower-level version of this code.
If we assume the following:
then the code for A[i][j][k] = ... involves computing the address
of A[i][j][k] (i.e., where to store the value of the right-hand-side
expression).
That computation looks something like:
address = FP + <offset of A> - (i*10,000*4)-(j*100*4)-(k*4)
T0 = i*j*k
And we have the following loop-invariant expressions:
T1 = FP + <offset of A> - i*40000 -j*400 - k*4
Store T0, 0(T1)
invariant to i loop: FP + <offset of A>
invariant to j loop: i*40000
invariant to k loop: i*j, j*400
tmp0 = FP + <offset of A>
for (i=0; i<100; i++) {
tmp1 = tmp0 - i*40000
for (j=0; j<100; j++) {
tmp2 = tmp1 - j*400
tmp3 = i*j
for (k=0; k<100; k++) {
T0 = tmp3 * k // T0 is i*j*k
T1 = tmp2 - k*4 // T1 is the address of A[i][j][k]
store T0, 0(T1) // store i*j*k into A[i][j][k]
}
}
}
Original Code
New Code
5 multiplications (3 for lvalue, 2 for rvalue)
3 subtractions (for lvalue)
1 indexed store
2 multiplications(1 for lvalue, 1 for rvalue)
1 subtraction (for lvalue)
1 indexed store
Answers:
If evaluating the expression might cause an error,
then there is a possible problem if the expression might not be
executed in the original, unoptimized code. For example:
b = a;
while (a != 0) {
x = 1/b; // Possible divide by zero if 1/b is moved out of the loop
a--;
}
What about preserving the order of events?
e.g. if the unoptimized code performed output then had a runtime error,
is it valid for the optimized code to simply have a runtime error? Also
note that changing the order of floating-point computations may change
the result, due to differing precisions.
If the computation might not execute in the
original program, moving the computation might actually slow the
program down!
while (x < i + j * 100) // j*100 will always be evaluated
What are some examples of loops for which the compiler can be sure that the loop will execute at least once?
Given a loop of the form: "for i from low to high do ..."
An induction expression is an expression that is used inside the loop, and has the form: "i * k1 + k2"
where "i" is the loop index, and "k1" and "k2" are constant with respect to the loop.
Consider the sequences of values for variable i and for the induction expression:
|
|
|
|
| 1 | low | low * k1 + k2 |
| 2 | low + 1 | (low + 1) * k1 + k2 = low * k1 + k1 + k2 |
| 3 | low + 1 + 1 | (low + 1 + 1) * k1 + k2 = low * k1 + k1 + k1 + k2 |
temp = low * k1 + k2 // initialize temp to first value of induction exp
for i from low to high do
... temp ... // Use temp in place of i * k1 + k2
temp = temp + k1 // increment temp at the end of the loop by adding k1
end
Note that instead of doing a multiplication and an addition each time
around the loop, we now do just one addition each time.
Although in this example we've removed a multiplication,
in general we are replacing a multiplication with an addition (that is
why this optimization is called reduction in strength).
In particular, if there were no k2, the original induction
expression would be: i * k1, and that would be replaced
inside the loop by: temp = temp + k1;
an addition replaces a multiplication.
Some languages actually have for-loops with the syntax used above (for i from low to high do ...), but other languages (including Java) do not use that syntax. Must a Java compiler give up on performing this optimization, or might it be able to recognize opportunities in some cases?
Now let's see how to apply this optimization to the example code we used to illustrate moving loop-invariant computations out of the loop. Below is the code we had after moving the loop-invariant computations; the induction expressions are shown in bold red:
tmp0 = FP + offset A
for (i=0; i<100; i++) {
tmp1 = tmp0 - i*40000 // i * -40000 + tmp0
for (j=0; j<100; j++) {
tmp2 = tmp1 - j*400 // j * -400 + tmp1
tmp3 = i*j // j * i + 0
for (k=0; k<100; k++) {
T0 = tmp3 * k // k * tmp3 + 0
T1 = tmp2 - k*4 // k * -4 + tmp2
store T0, 0(T1)
}
}
}
After performing the reduction in strength optimizations:
tmp0 = FP + offset A
temp1 = tmp0 // temp1 = 0*-40000+tmp0
for (i=0; i<100; i++) {
tmp1 = temp1
temp2 = tmp1 // temp2 = 0*-400+tmp1
temp3 = 0 // temp3 = 0*i+0
for (j=0; j<100; j++) {
tmp2 = temp2
tmp3 = temp3
temp4 = 0 // temp4 = 0*tmp3+0
temp5 = tmp2 // temp5 = 0*-4+tmp2
for (k=0; k<100; k++) {
T0 = temp4
T1 = temp5
store T0, 0(T1)
temp4 = temp4 + tmp3
temp5 = temp5 - 4
}
temp2 = temp2 - 400
temp3 = temp3 + i
}
temp1 = temp1 - 40000
}
In the original code, the innermost loop (executed 1,000,000 times) had
two multiplications and one subtraction.
In the optimized code, the inner loop has no multiplications,
one subtraction, and one addition.
(Similarly, the middle loop went from two multiplications and one
subtraction to no multiplications, one subtraction, and one addition;
the outer loop went from one multiplication and one subtraction to
no multiplications and one subtraction.)
On the other hand, we have added a number of assignments;
for example, the inner loop had just two assignments, and now it
has four.
To see how to deal with that, read the next section, about
copy propagation!
Suppose that the index variable is incremented by something other than one each time around the loop. For example, consider a loop of the form:
for (i=low; i<=high; i+=2) ...Can strength reduction still be performed? If yes, what changes must be made to the proposed algorithm?
Examples:
x = y x = y x = y
a = x + z // x can be if (...) x = 2 if (...) y = 3
// replaced a = x + z // x cannot be a = x + z // x cannot
// with y // replaced with // be replaced
// y; violates // with y;
// condition 1 // violates
// condition 2
Question: Why is this a useful transformation?
Answers:
x = 5; ==> x = 5; a = b + x; a = b + 5;For a machine like the MIPS, that allows an operand to be an "immediate" (literal) value but not a memory location, better code can be generated for the transformed version (that uses a = b + 5) than for the original version (that uses a = b + x), because we can avoid loading the value of x into a register. Furthermore, this kind of copy propagation can lead to opportunities for constant folding: evaluating, at compile time, an expression that involves only literals. For example:
x = 5; ==> x = 5; ==> x = 5; a = 3 + x; a = 3 + 5; a = 8;
while (...) {
x = a * b; // loop-inv
y = x * c;
...
}
Move "a * b" out of the loop:
tmp1 = a * b;
while (...) {
x = tmp1;
y = x * c;
...
}
Note that at this point, even if c is not modified in
the loop, we cannot move "x * c" out
of the loop, because x gets its value inside the loop.
However, after we do copy propagation:
tmp1 = a * b;
while (...) {
x = tmp1;
y = tmp1 * c;
...
}
"tmp1 * c" can also be moved out of the loop:
tmp1 = a * b;
tmp2 = tmp1 * c;
while (...) {
x = tmp1;
y = tmp2;
...
}
Given a definition d that is a copy statement: x = y, and a use u of x, we must determine whether the two important properties hold that permit the use of x to be replaced with y.
The first property (use u is reached only by definition d) is best solved using the standard "reaching-definitions" dataflow-analysis problem, which computes, for each definition of a variable x, all of the uses of x that might be reached by that definition. Note that this property can also be determined by doing a backward depth-first or breadth-first search in the control-flow graph, starting at use u, and terminating a branch of the search when a definition of x is reached. If definition d is the only definition encountered in the search, then it is the only one that reaches use u. (This technique will, in general, be less efficient than doing reaching-definitions analysis.)
The second property (that variable y cannot change its value between definition d and use u), can also be verified using dataflow analysis, or using a backwards search in the control-flow graph starting at u, and quitting at d. If no definition of y is encountered during the search, then its value cannot change, and the copy propagation can be performed. Note that when y is a literal, property 2 is always satisfied.
Below is our running example (after doing reduction in strength). The copy statements are shown in bold red font. In this particular example, each variable x that is defined in a copy statement reaches only one use. Comments indicate which of them cannot be propagated (because of a violation of property (1) -- in this example there are no instances where property (2) is violated).
tmp0 = FP + offset A
temp1 = tmp0 // cannot be propagated
for (i=0; i<100; i++) {
tmp1 = temp1
temp2 = tmp1 // cannot be propagated
temp3 = 0 // cannot be propagated
for (j=0; j<100; j++) {
tmp2 = temp2
tmp3 = temp3
temp4 = 0 // cannot be propagated
temp5 = tmp2 // cannot be propagated
for (k=0; k<100; k++) {
T0 = temp4
T1 = temp5
store T0, 0(T1)
temp4 = temp4 + tmp3
temp5 = temp5 - 4
}
temp2 = temp2 - 400
temp3 = temp3 + i
}
temp1 = temp1 - 40000
}
And here's the code after propagating the copies that are legal, and removing
the copy statements that become dead. Note that we are able to remove 5
copy statements, including 2 from the innermost loop.
tmp0 = FP + offset A
temp1 = tmp0
for (i=0; i<100; i++) {
temp2 = temp1
temp3 = 0
for (j=0; j<100; j++) {
temp4 = 0
temp5 = temp2
for (k=0; k<100; k++) {
store temp4 0(temp5)
temp4 = temp4 + temp3
temp5 = temp5 - 4
}
temp2 = temp2 - 400
temp3 = temp3 + i
}
temp1 = temp1 - 40000
}
Comparing this code with the original code, we see that, in the inner loop
(which is executed 1,000,000 times) we originally had 5 multiplications,
3 additions/subtractions, and 1 indexed store.
We now have no multiplications and just 2 additions/subtractions.
We have added 2 additions/subtractions and 2 copy statements to the
middle loop (which executes 10,000 times) and 1 addition/subtraction
and 1 copy statement to the outer loop (which executes 100 times), but
overall this should be a win!