Pointer Analysis

Overview

For most dataflow-analysis problems, the definition of the dataflow function for each statement (each CFG node) depends on what locations are defined and/or used by that statement. In the absence of aliasing, the def/use sets can be determined simply by looking at the statement. For example, for the statement x = y + z, we can see that x is defined, while y and z are used. (Call statements are an exception -- as we've seen, even in a language without aliasing, interprocedural analysis is need to determine what variables might be defined and/or used by a procedure call.)

Pointers (i.e., variables that hold the addresses of other quantities) cause there to be aliasing in the program. For example, consider the following code:

``` x = 0;
*p = 1;
print(x);
```
To solve dataflow problems like constant propagation, reaching definitions, or live variables, we need to know what variables might be modified by the statement *p = 1, which in turn requires knowing what locations p might point to at line 2. To determine that information we need to use some kind of pointer analysis.

The goal of pointer analysis is to determine safe points-to information for each CFG node; i.e., to determine for each CFG node, for each variable v, what locations v might be pointing to. Since we are interested in locations that might be pointed to, safe means a superset of the actual information; i.e., we want to err on the side of including too many locations in a points-to set, not too few.

We will describe a few ways to do pointer analysis for a C-like language, including single-procedure analysis and whole-program analysis.

Flow-Sensitive Pointer Analysis

To do intraprocedural pointer analysis using our usual lattice model and iterative algorithm, we need to determine the following:
• What is the lattice of dataflow facts (and its meet operation)?
• What dataflow function should be used for each kind of statement (including calls)?
• What safe information should be used as the initial dataflow fact at the start of each procedure?
One standard way to define the lattice of dataflow facts is using mappings from variables to sets of possibly pointed-to locations. Since we're ignoring heap storage for now, the possibly pointed-to locations will also be variables. And since we want the set of possibly pointed-to locations, the meet operation needs to include a location in a variable's points-to set whenever the location is in that set in either operand; i.e., the two sets of possibly pointed-to locations must be unioned together. An example CFG and its dataflow facts is given below.

TEST YOURSELF #1

For each of the following kinds of statements, say which variables' points-to sets would be changed by the corresponding dataflow function, and how they would be changed.

x = &y

x = y

x = *y

*x = y

Unfortunately, given no information about other procedures we would have to use very conservative dataflow functions for calls, and very conservative initial dataflow facts (for the entry nodes of the procedures). In particular, for a call we would have to update the points-to sets for all of the following:
• global variables;
• locals that may be pointed to (directly or transitively) by a global or actual parameter before the call;
• actual parameters whose addresses are passed (e.g., &x).
We would have to add all of the above variables to the points-to sets of all of the above variables. And for the entry node of a procedure other than main we would have to assume that all globals and formals might point to all globals and formals.

Note that at the start of a procedure, a global or formal may also point (directly or transitively) to a local variable of some other procedure. However, this is not relevant for most uses of the results of pointer analysis. For example, when doing live-variable analysis on this sequence of statements:

```x = 0;
y = 1;
*p = 1;
z = x+y;
```
we only care whether p may point to x or to y (to determine whether x is live immediately after the first assignment and whether y is live immediately after the second assignment); we don't care whether p may point to a local of another procedure.

We can get better results using the supergraph rather than using safe dataflow functions for calls and enter nodes. Another option is to use less precise but also less expensive flow-insensitive algorithms for pointer analysis, discussed below.

Flow-Insensitive Pointer Analysis

Flow-insensitive analysis ignores the flow of control in a program, and simply computes one dataflow fact for the whole program. One way to think about the goal of flow-insensitive analysis is that you model a program whose CFG has n nodes as a single node with n self-loop edges. Each edge has one of the dataflow functions from the nodes of the original CFG. Then you apply Kildall's iterative algorithm. Of course, the idea is to find a more efficient way to get the same result.

In general, even perfect flow-insensitive analysis provides very conservative results. It is not likely to be useful for dataflow problems like constant propagation, live variables, or reaching definitions. However, it seems to be useful in practice for pointer analysis.

Below we discuss a flow-insensitive pointer-analysis techniques. The technique assumes that the program being analyzed has been normalized so that there is no more than one pointer dereference per statement. For example, the statements

```x = **y;
*x = *z;
```
would be replaced by
```tmp = *y;
x = *tmp;
tmp = *z;
*x = tmp;
```

Andersen's Analysis

The way to understand Andersen's analysis is that it processes each statements in the program (in arbitrary order), building a points-to graph, until there is no change. (In a points-to graph, the edge ab means that a may point to b.) For example, suppose we're given the following set of statements:
```p = &a;
p = &b;
m = &p;
r = *m;
q = &c;
m = &q;
```
After one pass over this set of statements we would have the points-to graph shown below on the left, and after two iterations we would have the (final) graph shown below on the right (with the new edge added during the second iteration shown in red).

The actual algorithm is more efficient than this. Instead of building a points-to graph directly, it builds and manipulates a constraint structure, and only re-evaluates the effects of a statement that might cause a change in the points-to information. Nevertheless, the worst-case time for Andersen's algorithm is cubic (O(N3)) in the size of the program (assuming that the number of variables is linear in the size of the program).

To get some intuition on that N3 worst-case time, note that the most expensive assignment statements to process are those that involve a level of indirection; i.e., x = *y or *y = x. In both cases, we need to look at everything in the points-to set of y, and then everything in each of those points-to sets. in the worst case, y could point to all N variables in the program, and each of those could itself point to all N variables. Thus, processing the statement could involve O(N2) work. Since there are O(N) statements in the program, the total amount of work could be as much as O(N3).

TEST YOURSELF #2

Draw the points-to graph that would be produced by Andersen's analysis for the following set of statements:

```p1 = &a
p2 = &b
p3 = &p2
p1 = p2
p4 = *p3
*p3 = &c
```