Motivation for Dataflow Analysis
A compiler can perform some optimizations based only on local information. For example, consider the following code:
x = a + b; x = 5 * 2;It is easy for an optimizer to recognize that:
Some optimizations, however, require more "global" information. For example, consider the following code:
a = 1; b = 2; c = 3; if (...) x = a + 5; else x = b + 4; c = x + 1;In this example, the initial assignment to c (at line 3) is useless, and the expression x + 1 can be simplified to 7, but it is less obvious how a compiler can discover these facts since they cannot be discovered by looking only at one or two consecutive statements. A more global analysis is needed so that the compiler knows at each point in the program:
1. k = 2; 2. if (...) { 3. a = k + 2; 4. x = 5; 5. } else { 6. a = k * 2; 7. x = 8; 8. } 9. k = a; 10. while (...) { 11. b = 2; 12. x = a + k; 13. y = a * b; 14. k++; 15. } 16. print(a+x);
Constant Propagation
The goal of constant propagation is to determine where in the program variables are guaranteed to have constant values. More specifically, the information computed for each CFG node n is a set of pairs, each of the form (variable, value). If we have the pair (x, v) at node n, that means that x is guaranteed to have value v whenever n is reached during program execution.
Below is the CFG for the example program, annotated with constant-propagation information.
Live-Variable Analysis
The goal of live-variable analysis is to determine which variables are "live" at each point in the program; a variable is live if its current value might be used before being overwritten. The information computed for each CFG node is the set of variables that are live immediately after that node. Below is the CFG for the example program, annotated with live variable information.
Draw the CFG for the following program, and annotate it with the constant-propagation and live-variable information that holds before each node.
N = 10; k = 1; prod = 1; MAX = 9999; while (k <= N) { read(num); if (MAX/num < prod) { print("cannot compute prod"); break; } prod = prod * num; k++; } print(prod);
Before thinking about how to define a dataflow problem, note that there are two kinds of problems:
Another way that many common dataflow problems can be categorized is as may problems or must problems. The solution to a "may" problem provides information about what may be true at each program point (e.g., for live-variables analysis, a variable is considered live after node n if its value may be used before being overwritten, while for constant propagation, the pair (x, v) holds before node n if x must have the value v at that point).
Now let's think about how to define a dataflow problem so that it's clear what the (best) solution should be. When we do dataflow analysis "by hand", we look at the CFG and think about:
This intuition leads to the following definition. An instance of a dataflow problem includes:
For constant propagation, an individual dataflow fact is a set of pairs of the form (var, val), so the domain of dataflow facts is the set of all such sets of pairs (the power set). For live-variable analysis, it is the power set of the set of variables in the program.
For both constant propagation and live-variable analysis, the "init" fact is the empty set (no variable starts with a constant value, and no variables are live at the end of the program).
For constant propagation, the combining operation ⌈⌉ is set intersection. This is because if a node n has two predecessors, p1 and p2, then variable x has value v before node n iff it has value v after both p1 and p2. For live-variable analysis, ⌈⌉ is set union: if a node n has two successors, s1 and s2, then the value of x after n may be used before being overwritten iff that holds either before s1 or before s2. In general, for "may" dataflow problems, ⌈⌉ will be some union-like operator, while it will be an intersection-like operator for "must" problems.
For constant propagation, the dataflow function associated with a CFG node that does not assign to any variable (e.g., a predicate) is the identity function. For a node n that assigns to a variable x, there are two possibilities:
For live-variable analysis, the dataflow function for each node n has the form: fn(S) = (S - KILLn) union GENn, where KILLn is the set of variables defined at node n, and GENn is the set of variables used at node n. In other words, for a node that does not assign to any variable, the variables that are live before n are those that are live after n plus those that are used at n; for a node that assigns to variable x, the variables that are live before n are those that are live after n except x, plus those that are used at n (including x if it is used at n as well as being defined there).
An equivalent way of formulating the dataflow functions for live-variable analysis is: fn(S) = (S intersect NOT-KILLn) union GENn, where NOT-KILLn is the set of variables not defined at node n. The advantage of this formulation is that it permits the dataflow facts to be represented using bit vectors, and the dataflow functions to be implemented using simple bit-vector operations (and or).
It turns out that a number of interesting dataflow problems have dataflow functions of this same form, where GENn and KILLn are sets whose definition depends only on n, and the combining operator ⌈⌉ is either union or intersection. These problems are called GEN/KILL problems, or bit-vector problems.
Consider using dataflow analysis to determine which variables might be used before being initialized; i.e., to determine, for each point in the program, for which variables there is a path to that point on which the variable was never defined. Define the "may-be-uninitialized" dataflow problem by specifying:
If you did not define the may-be-uninitialized problem as a GEN/KILL problem, go back and do that now (i.e., say what the GEN and KILL sets should be for each kind of CFG node, and whether ⌈⌉ should be union or intersection).
A solution to an instance of a dataflow problem is a dataflow fact for each node of the given CFG. But what does it mean for a solution to be correct, and if there is more than one correct solution, how can we judge whether one is better than another?
Ideally, we would like the information at a node to reflect what might
happen on all possible paths to that node. This ideal solution is
called the meet over all paths (MOP) solution, and is
discussed below. Unfortunately, it is not always possible to compute
the MOP solution; we must sometimes settle for a solution that
provides less precise information.
The MOP solution (for a forward problem) for each CFG node n is
defined as follows:
For instance, in our running example program there are two paths from
the start of the program to line 9 (the assignment k = a):
Combining the information from both paths, we see that the MOP
solution for node 9 is: k=2 and a=4.
It is worth noting that even the MOP solution can be overly
conservative (i.e., may include too much information for a "may"
problem, and too little information for a "must" problem), because not
all paths in the CFG are executable. For example, a program may
include a predicate that always evaluates to false (e.g., a
programmer may include a test as a debugging device -- if the program
is correct, then the test will always fail, but if the program
contains an error then the test might succeed, reporting that error).
Another way that non-executable paths can arise is when two predicates
on the path are not independent (e.g., whenever the first evaluates to
true then so does the second). These situations are
illustrated below.
Unfortunately, since most programs include loops, they also have
infinitely many paths, and thus it is not possible to compute the MOP
solution to a dataflow problem by computing information for every path
and combining that information. Fortunately, there are other ways to
solve dataflow problems (given certain reasonable assumptions about
the dataflow functions associated with the CFG nodes). As we shall
see, if those functions are distributive, then the solution
that we compute is identical to the MOP solution. If the functions
are monotonic, then the solution may not be identical to the
MOP solution, but is a conservative approximation.
The alternative to computing the MOP solution directly, is to solve a
system of equations that essentially specify that local information
must be consistent with the dataflow functions. In particular, we
associate two dataflow facts with each node n:
One question is whether, in general, our system of equations will have
a unique solution. The answer is that, in the presence of loops,
there may be multiple solutions. For example, consider the simple
program whose CFG is given below:
The equations for constant propagation are as follows (where
⌈⌉ is the intersection combining operator):
The solution we want is solution 4, which includes the most constant
information. In general, for a "must" problem the desired solution
will be the largest one, while for a "may" problem the desired
solution will be the smallest one.
Using the simple CFG given above, write the equations for
live-variable analysis, as well as the greatest and least solutions.
Which is the desired solution, and why?
Many different algorithms have been designed for solving a dataflow
problem's system of equations. Most can be classified as either
iterative algorithms or elimination algorithms.
These two classes of algorithms are discussed in the next two
sections.
Most of the iterative algorithms are variations on the following
algorithm (this version is for forward problems).
It uses a new value T (called "top"). T has the property
that, for all dataflow facts d, T ⌈⌉ d = d.
Also, for all dataflow functions, fn(T) = T.
(When we consider the lattice model for dataflow analysis we will see
that this initial value is the top element of the lattice.)
Run this iterative algorithm on the simple CFG given above (the one with
a loop) to solve the constant propagation problem.
Run the algorithm again on the example CFG from the
examples section of the notes.
This algorithm works regardless of the order in which nodes are
removed from the worklist. However, that order can affect the
efficiency of the algorithm. A number of variations have been
developed that involve different ways of choosing that order.
When we consider the lattice model, we will revisit the question of
complexity.
Three important questions are:
The answers are provided by the framework first defined by Kildall.
The next section provides background on lattices;
the section after that presents Kildall's framework.
Definition:
Note: "partial" means it is not necessary that for all x,y in S,
either x ⊆ y or y ⊆ x.
It is OK for a pair of set elements to be incomparable.
Example 1:
The set S is the set of English words, and the ordering ⊆ is
substring (i.e., w1 ⊆ w2 iff w1 is a substring of w2).
Here is a picture of some words and their ordering (having an
edge w1 → w2 means w1 > w2).
Example 2:
S is the set of English words, and the ordering ⊆ is
"is shorter than or equal to in length".
Example 3:
S is the set of integers, and the ordering ⊆ is "less than or equal to".
This is a poset (try verifying each of the three properties).
Example 4:
S is the set of integers and the ordering ⊆ is "strictly less than".
This is not a poset, because the ordering is not reflexive.
Example 5:
S is the set of all sets of letters and the ordering is subset.
This is a poset.
Definition:
The "Meet Over All Paths" Solution
Solving a Dataflow Problem by Solving a Set of Equations
These n.befores and n.afters are the variables of our equations,
which are defined as follows (two equations for each node n):
In addition, we have one equation for the enter node:
where p1,
p2, etc are n's predecessors in the CFG (and ⌈⌉ is the
combining operator for this dataflow problem).
These equations make intuitive sense: the dataflow
information that holds before node n executes is the combination of
the information that holds after each of n's predecessors executes,
and the information that holds after n executes is the result of
applying n's dataflow function to the information that holds before n
executes.
enter.after = empty set
Because of the cycle in the example CFG, the equations for
3.before, 3.after, 4.before, and
4.after are mutually recursive, which leads to the
four solutions shown below (differing on those four values).
1.before = enter.after
1.after = 1.before - (x, *) union (x, 2)
2.before = 1.after
2.after = if (x, c) is in 2.before then 2.before - (y, *) union (y, c), else 2.before - (y, *)
3.before = ⌈⌉(2.after, 4.after )
3.after = 3.before
4.before = 3.after
4.after = 4.before
Iterative Algorithms
Set enter.after = init. Set all other n.after to T.
Initialize a worklist to contain all CFG nodes except enter
and exit.
While the worklist is not empty:
Remove a node n from the worklist.
Compute n.before by combining all p.after such that p
is a predecessor of n in the CFG.
Compute
tmp = fn ( n.before )
If (tmp != n.after) then
Set n.after = tmp
Put all of n's successors on the worklist
The Lattice Model of Dataflow Analysis
Motivation
Recall that while we would like to compute the meet-over-all-paths (MOP)
solution to a dataflow problem, direct computation of that solution
(by computing and combining solution for every path) is usually not
possible.
Therefore, dataflow problems are usually solved by finding a solution
to a set of equations that define two dataflow facts (n.before and n.after)
for each CFG node n.
Background
Partially ordered sets
A partially ordered set (poset) is a set S that has a
partial ordering ⊆ , such that the ordering is:
Below are some examples of sets with orderings;
some are partially ordered sets and some are not.
candy then
/ \ / \
v v v v
annual and can the hen
\ | / \ /
\ | / v v
v v v he
an
|
v
a
Note that the "substring" ordering does have the three properties
required of a partial order:
candy
|
v
____ then
/ / | \
v v v v
can and the hen
\ \ / /
\ \ / /
v v v v
an
|
v
a
Does this ordering have the three properties?
Two out of three isn't good enough -- this is not a poset.
A lattice is a poset in which every pair of elements has:
The join of two elements x and y is defined to be the element z such that:
The first two rules say that z actually is an upper bound for
x and y, while the third rule says that z is the least
upper bound.
Pictorially:
z
/ \ z is the least upper bound of x and y
v v
y x
z w
|\/|
|/\| z is NOT the least upper bound of x and y
vv vv (they have NO least upper bound)
y z
The idea for the meet operation is similar, with the reverse orderings.
Examples:
Complete lattices
Definition:
Note: Every finite lattice (i.e., S is finite) is complete.
Note: Every complete lattice has a greatest element, "Top" (written
as a capital T) and a least
element "Bottom" (written as an upside-down capital T).
They are the least-upper and the
greatest-lower bounds of the entire underlying set S.
Monotonic and distributive functions
Definition:
Every distributive function is also monotonic (proving that could be good practice!) but not vice versa. For the GEN/KILL dataflow problems, all dataflow functions are distributive. For constant propagation, all functions are monotonic, but not all functions are distributive. For example, the dataflow function f associated with this assignment
f(S) = | if S == T then T |
else if (a, v1) in S and (b, v2) in S then S - (x,*) + (x, v1+v2) | |
else S - (x,*) |
Here is an important theorem about lattices and monotonic functions:
Theorem:
We can create new lattices from old ones using cross-product: if L1, L2, ..., Ln are lattices, then so is the cross-product of L1, L2, ..., Ln (which we can write as: L1 x L2 x ... x Ln). The elements of the cross-product are tuples of the form:
The ordering is element-wise: <e1, e2, ..., en> ⊆ <e1', e2', ..., en'> iff:
If L1, L2, ..., Ln are complete lattices, then so is their
cross-product.
The top element is the tuple that contains the top elements
of the individual lattices:
<top of L1, top of L2, ... , top of Ln>, and the
bottom element is the tuple that contains the bottom elements of
the individual lattices:
<bottom of L1, bottom of L2, ... , bottom of Ln>.
Recall that our informal definition of a dataflow problem included:
Kildall addressed this issue by putting some additional requirements
on D, ⌈⌉ , and fn.
In particular he required that:
Given these properties, Kildall showed that:
In 1977, a paper by Kam and Ullman (Acta Informatica 7, 1977)
extended Kildall's results to show that,
given monotonic dataflow functions:
To show that the iterative algorithm computes the greatest
solution to the set of equations, we can "transform" the
set of equations into a single, monotonic function L → L
(for a complete lattice L) as follows:
Consider the right-hand side of each equation to be a "mini-function".
For example, for the two equations:
Define the function that corresponds to all of the equations to be:
Note that every fixed point of f is a solution to the set of
equations!
We want the greatest solution. (i.e., the greatest fixed point)
To guarantee that this solution exists we need to know that:
To show (1), note that the each individual value in the tuple
is an element of a complete lattice. (That is required by Kildall's
framework.)
So since cross product (tupling) preserves completeness,
the tuple itself is an element of a complete lattice.
To show (2), note that the mini-functions that define each
n.after value are monotonic (since those are the dataflow
functions, and we've required that they be monotonic).
It is easy to show that the
mini-functions that define each n.before value are monotonic, too.
For a node n with k predecessors, the equation is:
base case k=1
We must show that given: a ⊆ a', f(a) ⊆ f(a').
For this f, f(a) = a, and f(a') = a', so this f is monotonic.
base case k=2
We must show that given: a1 ⊆ a1' and a2 ⊆ a2',
f(a1, a2) ⊆ f(a1', a2').
Induction Step
Assume that for all k < n
Given that all the mini-functions are monotonic, it is easy to show that
f (the function that works on the tuples that represent the nodes'
before and after sets) is monotonic;
i.e., given two tuples:
We now know:
Therefore:
Another approach to solving a dataflow problem is to solve a system
of equations that relates the dataflow facts that hold before each
node to the facts that hold after the node.
Kildall showed that if the dataflow functions are distributive,
then the (original version of the)
iterative algorithm always terminates,
and always finds the MOP solution.
Kam and Ullman later showed that if the dataflow functions are
monotonic then the iterative algorithm always finds the
greatest solution to the set of equations.
They also showed that if the functions are monotonic but not
distributive, then that solution is not always the same as
the MOP solution.
It is also true that the
greatest solution to the system of equations is always
an approximation to the MOP solution (i.e.,
may be lower in the lattice of solutions).
Summary of lattice theory
If L is a complete lattice and f is monotonic,
then f has a greatest and a least fixed point.
If L has no infinite descending chains then we can compute the
greatest fixed point of f via iteration ( f(T), f(f(T)) etc.)
Kildall's Lattice Framework for Dataflow Analysis
and that our goal is to solve a given instance of the problem
by computing "before" and "after" sets for each node of the control-flow
graph.
A problem is that, with no additional information about the domain D, the
operator ⌈⌉ , and the dataflow functions fn, we can't say, in
general, whether a particular algorithm for computing the before and
after sets works correctly (e.g., does the algorithm always halt?
does it compute the MOP solution? if not, how does the computed solution
relate to the MOP solution?).
He also required (essentially) that the iterative algorithm initialize
n.after (for all nodes n other than the enter node) to the lattice's
"top" value.
(Kildall's algorithm is slightly different from the iterative algorithm
presented here, but computes the same result.)
It is interesting to note that, while his theorems are correct,
the example dataflow problem that he uses (constant propagation)
does not satisfy his requirements;
in particular, the dataflow functions for constant propagation
are not distributive (though they are monotonic).
This means that the solution computed by the iterative algorithm
for constant propagation will not, in general be the MOP solution.
Below is an example to illustrate this:
1: enter
|
v
2: if (...)
/ \
v v
3: a = 2 4: a = 3
| |
v v
5: b = 3 6: b = 2
\ /
v v
7: x = a + b
|
v
8: print(x)
The MOP solution for the final print statement includes the pair (x,5),
since x is assigned the value 5 on both paths to that statement.
However, the greatest solution to the set of equations for this program
(the result computed using the iterative algorithm) finds that
x is not constant at the print statement.
This is because the equations require that n.before be the
meet of m.after for all predecessors m;
in particular, they require that the
"before" set for node 7 (x = a + b) has empty, since
the "after" sets of the two predecessors have (a,2), (b,3),
and (a,3), (b,2), respectively, and the intersection of
those two sets is empty.
Given that value for 7.before, the equations require that
7.after (and 8.before) say that x is not constant.
We can only discover that x is constant after node 7 if both a and b
are constant before node 7.
n3.before = n1.after meet n2.after
The two mini-functions, g11 and g12 are:
n3.after = f3( n3.before )
g11(a, b) = a meet b
g12(a) = f3( a )
f( <n1.before,n1.after,n2.before,n2.after ...> ) =
<g11(..),g12(...),g21(..), g22(...),...>
Where the (...)s are replaced with the appropriate arguments to those
mini-functions. In other words, function f takes one argument that is
a tuple of values. It returns a tuple of values, too. The returned
tuple is computed by applying the mini-functions associated with each
of the dataflow equations to the appropriate inputs (which are part
of the tuple of values that is the argument to function f).
n.before = m1.after meet m2.after meet ... meet mk.after
and the corresponding mini-function is:
f(a1, a2, ..., ak) = a1 meet a2 meet ... meet ak
We can prove that these mini-functions are monotonic by induction on k.
f(a1) = a1
f(a1, a2) = a1 meet a2
(a1 meet a2) ⊆ a1 , and
(a1 meet a2) ⊆ a2
(a1 meet a2) ⊆ a1 ⊆ a1' implies (a1 meet a2) ⊆ a1', and
(a1 meet a2) ⊆ a2 ⊆ a2' implies (a1 meet a2) ⊆ a2'
a1 ⊆ a1'
and a2 ⊆ a2'
and ... and an-1 ⊆ a'n-1 =>
Now we must show the same thing for k=n
f(<a1, ..., an-1) ⊆ f(<a1',... a'n-1>)
we need to show: x meet an ⊆ x' meet an'
t1 = <e1, e2, ..., en>, and
such that: t1 ⊆ t2, we must show f(t1) ⊆ f(t2).
Recall that, for a cross-product lattice, the ordering is element-wise;
thus, t1 ⊆ t2 means: ek ⊆ ek', for all k.
We know that all of the mini-functions g are monotonic, so for all k,
gk(ek) ⊆ gk(ek').
But since the ordering is element-wise, this is exactly what it means for
f to be monotonic!
t2 = <e1', e2', ..., en'>
This is not quite what the iterative algorithm does,
but it is not hard to see that it is equivalent to one that does
just this: initialize
all n.before and n.after to top, then on each iteration,
compute all of the "mini-functions" (i.e., recompute
n.before and n.after for all nodes) simultaneously,
terminating when there is no change.
The actual iterative algorithm presented here is an optimization
in that it only recomputes n.before and n.after for a node n
when the "after" value of some predecessor has changed.
SUMMARY
Given:
the goal of dataflow analysis is to compute a "dataflow fact"
(an element of L) for each CFG node.
Ideally, we want the MOP (meet over all paths) solution, in which
the fact at node n is the combination of the facts induced by all paths to n.
However, for CFGs with cycles, it is not possible to compute this solution
directly.