The goal of optimization is to produce better code (fewer instructions,
and, more importantly, code that runs faster).
However, it is important not to change the behavior of the program (what
it computes)!
We will look at the following ways to improve a program:
Peephole Optimization.
This is done after code generation.
It involves finding opportunities to improve the generated code
by making small, local changes.
Moving Loop-Invariant Computations.
This is done before code generation.
It involves finding computations inside loops that can be
moved outside, thus speeding up the execution time of the loop.
Strength-Reduction in for Loops.
This is done before code generation.
It involves replacing multiplications inside
loops with additions. If it takes longer to execute a
multiplication than an addition, then this speeds up the
code.
Copy Propagation.
This is done before code generation.
It involves replacing the use of a variable with a literal
or another variable. Copy propagation can sometimes uncover
more opportunities for moving loop-invariant computations.
It may also make it
possible to remove some assignments from the program, thus
making the code smaller and faster.
Peephole Optimization
The idea behind peephole optimization is to examine the code "through a
small window," looking for special cases that can be improved.
Below are some common optimizations that can be performed this way.
Note that in all cases that involve removing an instruction,
it is assumed that that instruction is not the target of a branch.
Remove a redundant load (fewer instructions generated, and
fewer executed):
Remove a redundant push/pop (fewer instructions generated, and
fewer executed):
Replace a jump to a jump (same number of instructions generated,
but fewer executed):
Remove a jump to the next instruction (fewer instructions generated,
and fewer executed):
Replace a jump around jump (fewer instructions generated;
possibly fewer executed):
Remove useless operations (fewer instructions generated and fewer
executed):
Reduction in strength: don't use a slow, general-purpose instruction where
a fast, special-purpose instruction will do (same number of instructions, but faster):
Note that doing one optimization may enable another: for example:
public class Opt {
public static void main() {
int a;
int b;
if (true) {
if (true) {
b = 0;
}
else {
b = 1;
}
return;
}
a = 1;
b = a;
}
}
Question 1:
The code generated for this program contains opportunities for the first two
kinds of peephole optimization (removing a redundant load, and replacing
a jump to a jump).
Can you explain how those opportunities arise just by looking at the
source code?
Question 2:
Below is the generated code.
Verify your answer to question 1 by finding the opportunities for the
two kinds of optimization.
What other opportunity for removing redundant code is common in this example?
For greatest gain, optimize "hot spots" i.e., inner loops.
An expression is "loop invariant" if the same value is
computed for that expression on every iteration of the loop.
Instead of computing the same value over and over, compute the
value once outside the loop and reuse it.
Example:
for (i=0; i<100; i++) {
for (j=0; j<100; j++) {
for (k=0; k<100; k++) {
A[i][j][k] = i*j*k
}
}
}
In this example, i*j
is invariant with respect to the inner loop.
But there are more loop-invariant expressions; to find them,
we need to look at a lower-level version of this code.
If we assume the following:
A is a 3D array
each element requires 4 bytes
elements are stored in the current activation record in row-major order
(note: in Java, arrays are allocated from the heap, not stored on the
stack; however, in other languages they may be stored on the stack)
then the code for A[i][j][k] = ... involves computing the address
of A[i][j][k] (i.e., where to store the value of the right-hand-side
expression).
That computation looks something like:
address = FP + <offset of A> - (i*10,000*4)-(j*100*4)-(k*4)
So the code for the inner loop is actually something like:
T0 = i*j*k
T1 = FP + <offset of A> - i*40000 -j*400 - k*4
Store T0, 0(T1)
And we have the following loop-invariant expressions:
invariant to i loop: FP + <offset of A>
invariant to j loop: i*40000
invariant to k loop: i*j, j*400
We can move the computations of the loop-invariant expressions out of
their loops, assigning the values of those expressions to new temporaries,
and then using the temporaries in place of the expressions.
When we do that for the example above, we get:
Here is a comparison of the original code and the optimized code (the number
of instructions performed in the innermost loop, which is executed
1,000,000 times):
Original Code
New Code
5 multiplications (3 for lvalue, 2 for rvalue)
2 multiplications (1 for lvalue, 1 for rvalue)
3 subtractions (for lvalue)
1 subtractions (for lvalue)
1 indexed store
1 indexed store
Questions:
How do we recognize loop-invariant expressions?
When and where do we move the computations of those expressions?
Answers:
An expression is invariant with respect to a loop if for
every operand, one of the following holds:
It is a literal, or
It is a variable that gets its value only from outside
the loop.
To answer question 2, we need to consider safety and
profitability.
Safety
If evaluating the expression might cause an error,
then there is a possible problem if the expression might not be
executed in the original, unoptimized code. For example:
What about preserving the order of events?
e.g. if the unoptimized code performed output then had a runtime error,
is it valid for the optimized code to simply have a runtime error? Also
note that changing the order of floating-point computations may change
the result, due to differing precisions.
Profitability
If the computation might not execute in the
original program, moving the computation might actually slow the
program down!
Moving a computation is both safe and profitable if one of the
following holds:
It can be determined that the loop will execute at least once
and the code is guaranteed to execute if the loop does:
it isn't inside any condition, or
it is on all paths through the loop (e.g., it
occurs in both branches of an if-then-else).
The expression is in (a non short-circuited) part of the loop
test / loop bounds, e.g.:
while (x < i + j * 100) // j*100 will always be evaluated
The basic idea here is to take advantage of patterns in for-loops to replace expensive
operations, like multiplications, with cheaper ones, like additions.
The particular pattern that we will handle takes the general form of a loop where:
L is the loop index
B is the beginning value of the loop
E is the end value of the loop
The body of the loop contains a right-hand-side expression of the form L * M + C. We call this the induction expression.
The factors of the induction expression, M and C, must be constant with respect to the loop.
These rules define a sort-of "template" of the following form^{*
The Ackermann function is famously slow to compute. In this example, the resultant call will return a number with nearly 20,000 digits.
}
:
for L from B to E do {
$\vdots$
$\ldots$ = L * M + C
$\vdots$
}
Consider the sequences of values for L and
for the induction expression:
Iteration #
L
L * M + C
1
B
B * M + C
2
B + 1
(B + 1) * M + C = B * M + M + C
3
B + 1 + 1
(B + 1 + 1) * M + C = B * M + M + M + C
Note that in each case, the part of the induction expression highlighted in orange
is the same as the value of the whole expression on the previous iteration,
and the non-highlighted part each time is always + M.
In other words, each time around the loop, the induction expression
increases by adding M, a constant value!
So we can avoid doing the multiplication each time around the loop by:
computing B * M + C once before the loop,
storing that value in a temporary,
using the temporary instead of the expression inside the loop, and
incrementing the temporary by M at the end of the loop.
Here is the transformed loop:
ind = B * M + C //Initialize temp to first value of expression
for L from B to E do {
$\vdots$
$\ldots$ = ind //Use ind instead of recalculating expression
$\vdots$
ind = ind + M //Increment ind at the end of the loop by M
}
Note that instead of doing a multiplication and an addition each time
around the loop, we now do just one addition each time.
Although in this example we've removed a multiplication,
in general we are replacing a multiplication with an addition (that is
why this optimization is called reduction in strength).
Although this pattern may seem restrictive, in practice many loops fit into
this template, especially since we allow M or C
to be absent.
In particular, if there were no C, the original induction
expression would be: L * M, and that would be replaced
inside the loop by: ind = ind + M;
an addition replaces a multiplication.
Some languages actually have for-loops with the syntax used above
(for i from low to high do ...), but other languages
(including Java) do not use that syntax.
Must a Java compiler give up on performing this optimization, or
might it be able to recognize opportunities in some cases?
As mentioned above, many loops naturally fit the template for strength reduction that
we defined above.
Now let's see how to apply this optimization to the example
code we used to illustrate moving loop-invariant computations
out of the loop.
Below is the code we had after moving the loop-invariant
computations. Each induction expression is circled and identified by a number:
Original Expression
Loop Index (L)
Multiply Term (M)
Addition Term (C)
#1: tmp0 - i * 40000
i
-40000
tmp0
#2: tmp1 - j * 400
j
-400
tmp1
#3: i * j
j
i
0
#4: tmp3 * k
k
tmp3
0
#5: tmp2 - k * 4
k
-4
tmp2
After performing the reduction in strength optimizations:
In the original code, the innermost loop (executed 1,000,000 times) had
two multiplications and one subtraction.
In the optimized code, the inner loop has no multiplications,
one subtraction, and one addition.
(Similarly, the middle loop went from two multiplications and one
subtraction to no multiplications, one subtraction, and one addition;
the outer loop went from one multiplication and one subtraction to
no multiplications and one subtraction.)
On the other hand, we have added a number of assignments;
for example, the inner loop had just two assignments, and now it
has four.
We'll deal with that in the next section using copy
propagation
Statements of the form x = y (call this definition $d$)
are called copy statements.
For every use $u$ of variable x reached by definition $d$
such that:
no other definition of x reaches $u$, and
y can't change between $d$ and $u$
we can replace the use of x at $u$ with a use of y.
Examples:
Question: Why is this a useful transformation?
Answers:
If all uses of x reached by $d$ are replaced,
then definition $d$ is useless, and can be removed.
Even if the definition cannot be removed, copy propagation
can lead to improved code:
If the definition is actually of the form: x = literal,
then copy propagation can create opportunities for better code:
For a machine like the MIPS, there are fast instructions that can be used
when one of the operands is an "immediate" (literal) value. These
instructions can be used for operation a = b + 5, since one of the
operands is a literal, but not for a = b + x. The improvemnt is even
more striking because MIPS doesn't allow arithmetic operands to be memory
locations, so to generate assembly for statements like a = b + x, it
would be necessary to load the values for both b and x into registers,
which would require additional load instructions^{*
It's worth noting that compilers work hard to keep values in registers and avoid loads. This process is called register allocation. As such, it's possible that all operand values are already in registers when an operation occurs, so it may not necessarily be true that we're saving a load here. Nevertheless, it's always better to use constants where possible.
}
Furthermore, this kind of copy propagation can lead to
opportunities for constant folding: evaluating,
at compile time,
an expression that involves only literals.
For example:
Sometimes copy propagation can be combined with moving
loop-invariant computations out of the loop, to lead to a better
overall optimization.
For example:
while (...) {
x = a * b; // loop-inv
y = x * c;
...
}
Move "a * b" out of the loop:
tmp1 = a * b;
while (...) {
x = tmp1;
y = x * c;
...
}
Note that at this point, even if c is not modified in
the loop, we cannot move "x * c" out
of the loop, because x gets its value inside the loop.
However, after we do copy propagation:
tmp1 = a * b;
while (...) {
x = tmp1;
y = tmp1 * c;
...
}
"tmp1 * c" can also be moved out of the loop:
tmp1 = a * b;
tmp2 = tmp1 * c;
while (...) {
x = tmp1;
y = tmp2;
...
}
Given a definition d that is a copy statement: x = y,
and a use $u$ of x, we must
determine whether the two important properties hold that permit
the use of x to be replaced with y.
The first property (use $u$ is reached only by definition $d$)
is best solved using the standard "reaching-definitions" dataflow-analysis
problem, which computes, for each definition of a variable x,
all of the uses of x that might be reached by that definition.
Note that this property can also be determined by doing a backward
depth-first or breadth-first search in the control-flow graph, starting at
use $u$, and terminating a branch of the search when a definition of x
is reached. If definition d is the only definition encountered
in the search, then it is the only one that reaches use $u$.
(This technique will, in general, be less efficient than doing
reaching-definitions analysis.)
The second property (that variable y cannot change its value
between definition d and use $u$), can also be verified using dataflow
analysis, or using a backwards search in the control-flow graph starting
at $u$, and quitting at $d$.
If no definition of y is encountered during the search, then
its value cannot change, and the copy propagation can be performed.
Note that when y is a literal, property (2) is always satisfied.
Below is our running example (after doing reduction in strength).
Each copy statements either has a red X next to it (if it can't be
propagated) or a green check (if it can be propagated).
In this particular example, each variable x that is defined in a copy
statement reaches only one use.
Comments indicate which of them cannot be propagated (because of a violation
of property (1) -- in this example there are no instances where property (2)
is violated).
Here's the code after propagating the copies that are legal, and removing
the copy statements that become dead. Note that we are able to remove 5
copy statements, including 2 from the innermost loop.
Comparing this code with the original code, we see that, in the inner loop
(which is executed 1,000,000 times) we originally had 5 multiplications,
3 additions/subtractions, and 1 indexed store.
We now have no multiplications and just 2 additions/subtractions.
We have added 2 additions/subtractions and 2 copy statements to the
middle loop (which executes 10,000 times) and 1 addition/subtraction
and 1 copy statement to the outer loop (which executes 100 times), but
overall this should be a win!