If the only source of aliasing is reference parameters, we can use the GMOD/GREF computations discussed in the paper Fast Interprocedural Alias Analysis by Cooper and Kennedy (or in the on-line notes on interprocedural analysis) to determine, for each reference parameter, to what other reference parameters and to what globals it might be aliased, and for each global, to what reference parameters it might be aliased. However, pointers introduce another kind of aliasing that cannot be handled by the usual GMOD/GREF computation.
For example, consider the following code:
x = 0; *p = 1; print(x);In order to solve dataflow problems like constant propagation, reaching definitions, or live variables, we need to know what variables might be modified by the statement *p = 1, which in turn requires knowing what locations p might point to at line 2. To determine that information we need to use some kind of pointer analysis.
The goal of pointer analysis is to determine safe points-to information for each CFG node; i.e., to determine for each CFG node, for each variable v, what locations v might be pointing to. Since we are interested in locations that might be pointed to, safe means a superset of the actual information; i.e., we want to err on the side of including too many locations in a points-to set, not too few.
We will consider several ways to do pointer analysis for a C-like language,
including single-procedure analysis using Kildall's lattice model,
and several flow-insensitive, whole-program analyses.
We will start by ignoring complicating language aspects like
heap-allocated storage, arrays, structures, and casting, but
we'll talk about them at the end.
Flow-Sensitive Pointer Analysis
To do intraprocedural pointer analysis using our usual lattice model and
iterative algorithm, we need to determine the following:
One standard way to define the lattice of dataflow facts is using
mappings from variables to sets of possibly pointed-to locations.
Since we're ignoring heap storage for now, the possibly
pointed-to locations will also be variables.
And since we want the set of possibly pointed-to locations, the
meet operation needs to include a location in a variable's points-to
set whenever the location is in that set in either operand;
i.e., the two sets of possibly pointed-to locations must be unioned
together.
An example CFG and its dataflow facts is given below.
For each of the following kinds of statements, say which variables' points-to sets would be changed by the corresponding dataflow function, and how they would be changed.
x = y
x = *y
*x = y
Note that at the start of a procedure, a global or formal may also point (directly or transitively) to a local variable of some other procedure. However, this is not relevant for most uses of the results of pointer analysis. For example, when doing live-variable analysis on this sequence of statements:
x = 0; y = 1; *p = 1; z = x+y;we only care whether p may point to x or to y (to determine whether x is live immediately after the first assignment and whether y is live immediately after the second assignment); we don't care whether p may point to a local of another procedure.
We can get better results using the supergraph rather than
using safe dataflow functions for calls and enter nodes.
Another option is to use less precise but also less expensive
flow-insensitive algorithms for pointer analysis, discussed below.
Flow-insensitive analysis ignores the flow of control
in a program, and simply computes one dataflow
fact for the whole program.
One way to think about the goal of flow-insensitive analysis is that
you model a program whose CFG has n nodes as a single node with n
self-loop edges.
Each edge has one of the dataflow functions from the nodes of the
original CFG.
Then you apply Kildall's iterative algorithm.
Of course, the idea is to find a more efficient way to get the same
result.
In general, even perfect flow-insensitive analysis provides very
conservative results.
It is not likely to be useful for dataflow problems like
constant propagation, live variables, or reaching definitions.
However, it seems to be useful in practice for pointer analysis.
Below we discuss four flow-insensitive pointer-analysis techniques.
They vary in terms of their precision (how close they come to computing
the same results as the ideal version defined above that uses a
single-node CFG) and their worst-case runtimes:
The actual algorithm is more efficient than this.
Instead of building a points-to graph directly, it builds and
manipulates a constraint structure, and only
re-evaluates the effects of a statement that might
cause a change in the points-to information.
Nevertheless, the worst-case time for Andersen's algorithm
is cubic (O(N3)) in the size of the program (assuming that
the number of variables is linear in the size of the program).
To get some intuition on that N3 worst-case time,
note that the most expensive assignment statements to process
are those that involve a level of indirection; i.e.,
x = *y or *y = x.
In both cases, we need to look at everything in the points-to set
of y, and then everything in each of those points-to sets.
in the worst case, y could point to all N variables in the program,
and each of those could itself point to all N variables.
Thus, processing the statement could involve O(N2)
work.
Since there are O(N) statements in the program, the total
amount of work could be as much as O(N3).
Flow-Insensitive Pointer Analysis
All four techniques assume that the program being analyzed has
been normalized so that there is no more than one pointer
dereference per statement.
For example, the statements
x = **y;
*x = *z;
would be replaced by
tmp = *y;
x = *tmp;
tmp = *z;
*x = tmp;
Andersen's Analysis
The way to understand Andersen's analysis is that it processes
each statements in the program (in arbitrary order), building a
points-to graph, until there is no change.
(In a points-to graph, the edge a → b means
that a may point to b.)
For example, suppose we're given the following set of statements:
p = &a;
p = &b;
m = &p;
r = *m;
q = &c;
m = &q;
After one pass over this set of statements we would have the
points-to graph shown below on the left, and after two iterations
we would have the (final) graph shown below on the right (with the
new edge added during the second iteration shown in red).
Draw the points-to graph that would be produced by Andersen's analysis for the following set of statements:
p1 = &a p2 = &b p3 = &p2 p1 = p2 p4 = *p3 *p3 = &c
To process one assignment statement the algorithm performs the following steps:
Note: If steps 1 and/or 2 involve a node that isn't in the graph yet, then a dummy node is added. This is illustrated by the example below.
Processing one statement can cause multiple nodes to be merged, but the total number of merges can be no more than the number of variables. Merging nodes and finding a representative node are done using Tarjan's union-find algorithm, which has worst-case time proportional to inverse Ackermann, which is essentially constant. Therefore, the whole algorithm is linear in the size of the program.
Draw the points-to graph that would be produced by Steensgaard's analysis for the following set of statements:
p1 = &a p2 = &b p3 = &p2 p1 = p2 p4 = *p3 *p3 = &c
Given value k (the amount of out-degree to allow), Shapiro's first algorithm divides the variables in the program into k "categories". It then processes each statement in the program once, updating the points-to graph to reflect the effect of the statement. When the points-to graph is updated to make a variable x point to both v1 and v2, the nodes for v1 and v2 are merged iff v1 and v2 are in the same category. As in Steensgaard's approach, if an update to the points-to graph involves a node that isn't in the graph yet, one or more dummy nodes are added. A dummy node has "tentative" outgoing edges (one for each category). The tentative edge for category c gets turned into a regular points-to edges if the dummy node gets filled in, and its points-to set for category c also gets filled in.
This technique is illustrated below, using our running example.
In this example, the results of Shapiro's points-to analysis are less precise than Andersen's (because the points-to facts p→c and q→b, are included) but more precise than Steensgaard's (because the points-to-fact q→a is not included).
Note that the choice of categories is important: if instead of the categories used in the above example we had used { a, c, q, m } and { b, p, r }, the result would have been closer to the result using Andersen's analysis; and if we had used { a, b, c } and { p, q, r, m }, the result would have been the same as Steensgaard's analysis.
However, although different choices of categories can lead to different results, all of those results are safe, This leads to an interesting observation: If the points-to sets computed using different categories are all safe (are all over approximations, e.g., the sets may be too large but are never too small) then the intersections of the results are also safe. This leads to the second version of Shapiro's points-to analysis:
Analyze the code given below using one category (to get the same results as Steensgaard's algorithm); using two categories: {a, b} and {c, d}; and using four categories (to get the same results as Andersen's algorithm). Then try using the two categories: {a, c} and {b, d}, and intersect the results with the results you got using the previous two categories.
a = &b a = &c a = &d c = &d
An important question is how to assign variables to categories, and how to choose T (the number of times to run the original algorithm with different categories). One interesting answer investigated by Shapiro was to let the user choose the number of categories, k (between 2 and N, where N is the number of variables), then to use T = ceiling(logk(N)). This way, the categories can be chosen so that for every pair of variables there is at least one run for which they are in different categories. To do this:
a 00 b 01 c 10 d 11On the first run, the categories will be {a,b} and {c,d}, while on the second run they will be {a,c} and {b,d}.
The worst-case time for the first version of Shapiro's algorithm is O(k2 N), where N is the size of the program and k is the number of categories. As for Andersen's algorithm, the worst case for processing a single assignment statement is for a statement of the form x = *y or *y = x. This requires examining the targets of all edges out of y in the points-to graph, and then all edges out of those nodes. There are at most k edges out of any node, so the total number of edges that might need to be followed is k2. Each of the N statements in the program is processed once, so the total worst-case time is no more than O(k2 N).
For the second version, the total time is O(T k 2 N).
To guarantee that there is at least one run that "separates"
every pair of variables, we get O(logk k2 N).
PROGRAM 2
The loss of precision for Program 1 (the inclusion of the "extra" fact
p → k3), occurs when processing the assignment
q = p.
The way Steensgaard handles a "copy assignment" like that
is to merge the points-to sets of the left- and right-hand side
variables;
i.e., to treat the assignment as symmetric.
Das handles q = p differently:
instead of merging points-to sets, he introduces what he calls
a flow edge into the points-to graph from the node
that represents p's points-to set to the node that
represents q's points-to set.
(If those nodes don't exist, empty dummy nodes are added.)
This flow edge means that all
of the addresses in the points-to set of p
are also in the set of q.
In fact, flow edges are introduced when processing all
assignment statements,
including ones of the form v1 = &v2.
As in Steensgaard's algorithm, each node in the points-to graph
has only a single outgoing "points-to" edge, but it can have
any number of outgoing flow edges.
Below are the sequences of assignments for example Programs 1 and 2
again, and the corresponding sequences of points-to graphs built by
Das's algorithm.
The points-to graphs shown are actually slightly simplified;
we will see the missing parts below.
In the graphs, points-to edges are shown as plain arrows, and
flow edges are shown as arrows with stars above them.
Like Steensgaard, Das processes each assignment in the program
just once, building a points-to graph with both points-to and
flow edges.
After the graph is built, sets of variables must be propagated
forward along the flow edges to form the final points-to sets.
The results of this propagation step for the two final
points-to graphs shown above are shown below, along with the
final points-to sets for each variable.
The simplification in the above points-to graphs has to do with the
fact that
Das's algorithm is similar to Steensgaard's in that it does
merge some points-to sets.
In particular, whenever a flow edge n1 *→ n2
is added, all points-to sets "below" n1 and n2
are merged.
In order for this to work, whenever a flow
edge n1 *→ n2 is added to the graph,
outgoing points-to edges (to dummy nodes)
are also added if they don't already exist.
Below is an example that illustrates that, as well as illustrating
the merging of points-to sets performed by Das's algorithm.
The merging occurs after the last assignment (q = p).
In the figure below, this is illustrated using three points-to
graphs to illustrate (left-to-right)
It is worth noting that because Das's algorithm only merges nodes
below a flow edge, it maintains precision for "top-level"
pointers;
i.e., pointers that are not themselves pointed to.
Das's Analysis
Das's algorithm is based on the observation (made by doing studies of
C programs) that in C code, most pointers arise because the address of
a variable is passed as a parameter.
This is done either to simulate call-by-reference (i.e., so that the procedure can modify that parameter), or for efficiency (e.g., when the parameter is a struct or an array).
This observation is important because the use of pointer parameters
can cause Steensgaard's algorithm to lose precision when a pointer
parameter is passed on from one procedure to another, or when two different
procedures are called with the same pointer parameter.
For example, consider the two programs shown below.
For each program, the actual code is shown on the left, and
the equivalent set of assignment statements is shown on the right.
PROGRAM 1
void Q(int *q) { |
*q = ...; |
} |
|
void P(int *p) { |
Q(p); | q = p
} |
|
int main() { |
int k1, k2, k3; |
P(&k1); | p = &k1
P(&k2); | p = &k2
Q(&k3); | q = &k3
} |
_________________________________________________
void F(int *f) { |
... |
} |
|
void G(int *g) { |
... |
} |
|
int main() { |
int k1, k2, k3; |
F(&k1); | f = &k1
F(&k2); | f = &k2
G(&k2); | g = &k2
G(&k3); | g = &k3
} |
Here are the points-to graphs that would be built using
Andersen's analysis and using
Steensgaard's analysis:
In each points-to graph, the new nodes and edges are shown in red.
Draw the points-to graph and the final points-to sets that would be computed using Das's algorithm for the code below (that we used previously to illustrate the other flow-insensitive pointer analyses). Compare the final points-to sets with those computed by Andersen's algorithm.
p = &a; p = &b; m = &p; r = *m; q = &c; m = &q;
The propagation step takes time proportional to the sum of the sizes of the final points-to sets. In the worst case, this is O(N2), where N is the number of variables. However, in practice (see Experimental Results below), the time appears to be linear.
Das did some experiments to compare his algorithm with those of Andersen and Steensgaard. He found that