Pointers (i.e., variables that hold the addresses of other quantities) cause there to be aliasing in the program. For example, consider the following code:
x = 0; *p = 1; print(x);To solve dataflow problems like constant propagation, reaching definitions, or live variables, we need to know what variables might be modified by the statement *p = 1, which in turn requires knowing what locations p might point to at line 2. To determine that information we need to use some kind of pointer analysis.
The goal of pointer analysis is to determine safe points-to information for each CFG node; i.e., to determine for each CFG node, for each variable v, what locations v might be pointing to. Since we are interested in locations that might be pointed to, safe means a superset of the actual information; i.e., we want to err on the side of including too many locations in a points-to set, not too few.
We will describe a few ways to do pointer analysis for a C-like language, including single-procedure analysis and whole-program analysis.
(Note: throughout, we will ignore complicated language aspects like heap-allocated storage, arrays, structures, and casting; however, the techniques discussed can be extended to account for such language features.)
For each of the following kinds of statements, say which variables' points-to sets would be changed by the corresponding dataflow function, and how they would be changed.
x = y
x = *y
*x = y
Note that at the start of a procedure, a global or formal may also point (directly or transitively) to a local variable of some other procedure. However, this is not relevant for most uses of the results of pointer analysis. For example, when doing live-variable analysis on this sequence of statements:
x = 0; y = 1; *p = 1; z = x+y;we only care whether p may point to x or to y (to determine whether x is live immediately after the first assignment and whether y is live immediately after the second assignment); we don't care whether p may point to a local of another procedure.
We can get better results using the supergraph rather than using safe dataflow functions for calls and enter nodes. Another option is to use less precise but also less expensive flow-insensitive algorithms for pointer analysis, discussed below.
Flow-insensitive analysis ignores the flow of control in a program, and simply computes one dataflow fact for the whole program. One way to think about the goal of flow-insensitive analysis is that you model a program whose CFG has n nodes as a single node with n self-loop edges. Each edge has one of the dataflow functions from the nodes of the original CFG. Then you apply Kildall's iterative algorithm. Of course, the idea is to find a more efficient way to get the same result.
In general, even perfect flow-insensitive analysis provides very conservative results. It is not likely to be useful for dataflow problems like constant propagation, live variables, or reaching definitions. However, it seems to be useful in practice for pointer analysis.
Below we discuss a flow-insensitive pointer-analysis techniques. The technique assumes that the program being analyzed has been normalized so that there is no more than one pointer dereference per statement. For example, the statements
x = **y; *x = *z;would be replaced by
tmp = *y; x = *tmp; tmp = *z; *x = tmp;
p = &a; p = &b; m = &p; r = *m; q = &c; m = &q;After one pass over this set of statements we would have the points-to graph shown below on the left, and after two iterations we would have the (final) graph shown below on the right (with the new edge added during the second iteration shown in red).
The actual algorithm is more efficient than this. Instead of building a points-to graph directly, it builds and manipulates a constraint structure, and only re-evaluates the effects of a statement that might cause a change in the points-to information. Nevertheless, the worst-case time for Andersen's algorithm is cubic (O(N^{3})) in the size of the program (assuming that the number of variables is linear in the size of the program).
To get some intuition on that N^{3} worst-case time, note that the most expensive assignment statements to process are those that involve a level of indirection; i.e., x = *y or *y = x. In both cases, we need to look at everything in the points-to set of y, and then everything in each of those points-to sets. in the worst case, y could point to all N variables in the program, and each of those could itself point to all N variables. Thus, processing the statement could involve O(N^{2}) work. Since there are O(N) statements in the program, the total amount of work could be as much as O(N^{3}).
Draw the points-to graph that would be produced by Andersen's analysis for the following set of statements:
p1 = &a p2 = &b p3 = &p2 p1 = p2 p4 = *p3 *p3 = &c