Defining a Dataflow Problem

Before thinking about how to define a dataflow problem, note that there are two kinds of problems:

Forward problems (like constant propagation) where the information at a node n summarizes what can happen on paths from "enter" to n.
Backward problems (like live-variable analysis), where the information at a node n summarizes what can happen on paths from n to "exit".

In what follows, we will assume that we're thinking about a forward problem unless otherwise specified.

Another way that many common dataflow problems can be categorized is as may problems or must problems. The solution to a "may" problem provides information about what may be true at each program point (e.g., for live-variables analysis, a variable is considered live after node n if its value may be used before being overwritten, while for constant propagation, the mapping x → c holds before node n if x must have the value c at that point).

Now let's think about how to define a dataflow problem so that it's clear what the (best) solution should be. When we do dataflow analysis "by hand", we look at the CFG and think about:

What information holds at the start of the program.
When a node n has more than one incoming edge in the CFG, how to combine the incoming information (i.e., given the information that holds after each predecessor of n, how to combine that information to determine what holds before n).
How the execution of each node changes the information.

This intuition leads to the following definition. An instance of a dataflow problem includes:

a CFG,
a domain D of "dataflow facts",
a dataflow fact "init" (the information true at the start of the program for forward problems, or at the end of the program for backward problems),
an operator ⌈⌉ (used to combine incoming information from multiple predecessors),
for each CFG node n, a dataflow function f_n : D → D (that defines the effect of executing n).

For constant propagation, the domain of dataflow facts is the set of all possible maps from the variables in the program to values; for live-variable analysis, it is the power set of the set of variables in the program.

For constant propagation, the "init" fact is the empty mapping (no variables have constant values at the start of the program). For live-variable analysis, the "init" fact is the empty set (no variables are live at the end of the program).

For constant propagation, the combining operation ⌈⌉ is essentially intersection (x → v is in ⌈⌉(d1, d2) iff it is in both d1 and d2): if a node n has two predecessors, p1 and p2, then variable x has value v before node n iff it has value v after both p1 and p2. For live-variable analysis, ⌈⌉ is set union: if a node n has two successors, s1 and s2, then the value of x after n may be used before being overwritten iff that holds either before s1 or before s2. In general, for "may" dataflow problems, ⌈⌉ will be some union-like operator, while it will be an intersection-like operator for "must" problems.

For constant propagation, the dataflow function associated with a CFG node that does not assign to any variable (e.g., a predicate) is the identity function. For a node n that assigns to a variable x, the dataflow function determines whether the assigned value is a constant c (using the mapping of variables to values that holds before n executes to evaluate the right-hand side of the assignment); if so, the dataflow-function result is the same as its input except that x is mapped to c; if not, the function result is the same as its input except that there is no mapping at all for x (x is not constant after n). For live-variable analysis, the dataflow function for each node n has the form: f_n(S) = (S - KILL(n)) union GEN(n), where KILL(n) is the set of variables defined at node n, and GEN(n) is the set of variables used at node n. In other words, for a node that does not assign to any variable, the variables that are live before n are those that are live after n plus those that are used at n; for a node that assigns to variable x, the variables that are live before n are those that are live after n except x, plus those that are used at n (including x if it is used at n as well as being defined there).

An equivalent way of formulating the dataflow functions for live-variable analysis is: f_n(S) = (S intersect NOT-KILL(n)) union GEN(n), where NOT-KILL(n) is the set of variables not defined at node n. The advantage of this formulation is that it permits the dataflow facts to be represented using bit vectors, and the dataflow functions to be implemented using simple bit-vector operations (and or).

It turns out that a number of interesting dataflow problems have dataflow functions of this same form, where GEN and KILL are sets, and the combining operator ⌈⌉ is either union or intersection. These problems are called GEN/KILL problems, or bit-vector problems.

TEST YOURSELF #2

Consider using dataflow analysis to determine which variables might be used before being initialized; i.e., to determine, for each point in the program which variables might be uninitialized at that point. Define an instance of the "may-be-uninitialized" dataflow problem for the running example program by specifying:

the domain D of "dataflow facts",
the operator ⌈⌉,
the "init" dataflow fact,
a dataflow function for each CFG node n.

Annotate the CFG with the solution to the problem (the information that holds before each node).

If you did not define the may-be-uninitialized problem as a GEN/KILL problem, go back and do that now (i.e., say what the GEN and KILL sets should be for each kind of CFG node, and whether ⌈⌉ should be union or intersection).

Return to Dataflow Analysis table of contents.

Go to the previous section.

Go to the next section.