Static Analysis involves finding properties of programs without actually running them. There are many reasons why people want to do static analysis, including the following:
Most interesting properties of programs are undecidable, and even those that are not may be very expensive to compute. Therefore, static analysis usually involves some kind of abstraction. For example, instead of keeping track of all of the values that a variable may have at each point in a program, we might only keep track of whether a variable's value is positive, negative, zero, or unknown. Abstraction makes it possible to discover interesting properties of programs, but the results of static analysis are usually incomplete: for example, an analysis may say that a variable's value is unknown at a point when in fact the value will always be positive at that point.
In CS 701 we studied Dataflow Analysis. That is a commonly used framework for static analysis. However, one problem with standard dataflow analysis is that it provides no guarantees that the results are consistent with the program's semantics. In contrast, abstract interpretation is a static-analysis framework that does guarantee that the information gathered about a program is a safe approximation to the program's semantics. This property is achieved by establishing key relationships between the static analysis and the formal semantics.
Let's start with a very simple example: We'll define an abstract interpretation of a language of integer expressions including only literals, addition, and multiplication. The goal of the abstract interpretation will be to determine whether each (sub)-expression is negative, zero, or positive. For example, if we know that xxx is negative, while both yyy and zzz are positive. we can determine that the expression
Now let's formalize these ideas.
Syntax. Expressions involve literals, addition, and multiplication.
exp | → n | // literals |
→ exp + exp | // addition | |
→ exp * exp | // multiplication |
Standard Interpretation. We define the standard interpretation of expressions using denotational semantics; i.e., we add the definitions of the valuation functions below. Note that the (standard) meaning of an integer expression is an Int.
E[[n]] = n
E[[E1 + E2]] = E[[E1]] + E[[E2]]
E[[E1 * E2]] = E[[E1]] * E[[E2]]
Abstract Interpretation. We want our abstract interpretation to tell us whether an expression is negative, zero, or positive. However, we can't always do that. For example, a negative plus a positive can be either negative or positive. Therefore, our abstract domain, which we'll call Sign, must include a "don't know" value, num.
⊕ | neg | zero | pos | num |
---|---|---|---|---|
neg | neg | neg | num | num |
zero | neg | zero | pos | num |
pos | num | pos | pos | num |
num | num | num | num | num |
⊗ | neg | zero | pos | num |
---|---|---|---|---|
neg | pos | zero | neg | num |
zero | zero | zero | zero | zero |
pos | neg | zero | pos | num |
num | num | zero | num | num |
Here are the abstract valuation functions; note that the abstract meaning of an integer expression is a Sign.
Eabs[[n]] = if (n<0) then neg else if (n=0) then zero else pos
Eabs[[E1 + E2]] = Eabs[[E1]] ⊕ Eabs[[E2]]
Eabs[[E1 * E2]] = Eabs[[E1]] ⊗ Eabs[[E2]]
And here is an example of applying the abstract interpretation to an expression:
The abstract interpretation defined above for the "rule-of-signs" example was very simple and intuitive. Assuming that I didn't make any typographical errors when typing in the two tables, it shouldn't be hard to convince yourself that the abstract semantics is consistent with the standard (concrete) semantics. However, to be sure that this consistency holds, we must do the following:
1. Abstraction function α.
For the rule-of-signs example, the abstraction function is defined as follows:
α({0}) = | zero |
α(S) = | if all values in S are greater than 0 then pos |
else if all values in S are less than 0 then neg | |
else num |
2. Concretization function γ.
And the concretization function is defined as follows:
γ(zero) = {0} |
γ(pos) = {all positive ints} |
γ(neg) = {all negative ints} |
γ(num) = Int (i.e., all ints) |
3. Galois Connection.
A Galois connection is a pair of functions, α and γ between two partially ordered sets (C, ⊆) and (A, ≤), such that both of the following hold.
Here are the two relationships we need, presented pictorially:
For our example, poset A is the set containing the four elements of Sign (with num as the top element, and no ordering relationship among the other three elements), and poset C is the set of all sets of integers, ordered by subset. Here is a picture with all of A and some of C. Some of the alpha mapping (the abstraction function) is shown using red arrows, and some of the gamma mapping (the concretization function) is shown using blue arrows.
Question 1: Fill in the remaining alpha and gamma edges in the figure above.
Question 2: Show that alpha and gamma do form a Galois connection.
4. Safety. Our final obligation in proving that our rule-of-signs abstract interpretation is consistent with the standard semantics is to prove that, for every expression exp,
Base case: exp is literal k. This case has three parts (based on the definition of Eabs):
E[[k]] = k | // def of E |
Eabs[[k]] = neg | // def of Eabs |
γ(neg) = {all negative ints} | // def of γ |
E[[k]] = 0 | // def of E |
Eabs[[k]] = zero | // def of Eabs |
γ(zero) = {0} | // def of γ |
E[[k]] = k | // def of E |
Eabs[[k]] = pos | // def of Eabs |
γ(pos) = {all positive ints} | // def of γ |
Inductive Step
The inductive step is quite tedious. There are two cases (one for addition and one for multiplication), and each has 16 sub-cases (for all possible combinations of the signs of the two sub-expressions). Here is one example to show the flavor of the proof.
Inductive case 1: exp is e1 + e2.
RHS: | γ(Eabs[[e1 + e2]]) | |
= γ(Eabs[[e1]] ⊕ Eabs[[e2]]) | // def of Eabs |
sub-case 1: both Eabs[[e1]] and Eabs[[e2]] are neg.
= γ(neg ⊕ neg) | ||
= γ(neg) | // def of ⊕ | |
= { all negative ints } | // def of γ | |
LHS: | E[[e1 + e2]] | |
= E[[e1]] + E[[e2]] | // def of E |
By the induction hypothesis, E[[e1]] is a subset of γ(Eabs[[e1]]), which is γ(neg), which is the set of all negative ints. The same applies to E[[e1]]. Thus, the LHS is the sum of two negative ints, which is a negative int, which is certainly a subset of { all negative ints } (the final value for the RHS).
In what way does proving that
For the simple rule-of-signs example, we were able to define an abstract interpretation as a variation on the standard denotational semantics. For more realistic static-analysis problems, however, the standard denotational semantics is usually not a good place to start. This is because we usually want the results of static analysis to tell us what holds at each point in the program, and program points are usually defined to be the nodes of the program's control-flow graph (CFG). For example, for constant propagation we want to know, for each CFG node, which variables are guaranteed to have constant values when execution reaches that node. Therefore, it is better to start with a (standard) semantics defined in terms of a CFG.
There are various ways to define a CFG semantics. The most straightforward is to define what is called an operational semantics; think of it as an interpreter whose input is the entry node of a CFG plus an initial state (a mapping from variables to values), and whose output is the program's final state. We'll define the standard sementics in terms of transfer functions, one for each CFG node. These are (semantic) functions whose inputs are states and whose outputs are pairs that include both an output state and the CFG node that is the appropriate successor. A node's transfer function captures the execution semantics of that node and specifies the next node to be executed.
For example, consider the CFG shown below (with labels on the nodes).
+----------+ | 1: start | +----------+ | v +----------+ | 2: a = 1 | +----------+ | v +----------+ | 3: b = 1 | +----------+ | v +----------+ F +------------+ +---------+ +---> | 4: a < 3 |---->| 6: c = a+b |---->| 7: exit | | +----------+ +------------+ +---------+ | | | T | | v | +------------+ | | 5: a = a+b | | +------------+ | | | | +-----------+
For this example, the transfer function for node 2 would be defined as follows:
Here's a (recursive) definition of the interpreter (the operational semantics). We use fn to mean the transfer function defined for CFG node n.
interp = λs.λn. | if isExitNode(n) then s |
else let (s', n') = fn(s) in interp s' n' |
Because this definition is recursive, we need to use the usual trick of abstracting on the function and defining the operational semantics as the least fixed point of that abstraction:
semantics = fix(λF.λs.λn. | if isExitNode(n) then s |
else let (s', n') = fn(s) in F s' n') |
While the operational semantics discussed above is defined in terms of the program's CFG, it has two properties that are undesirable as the basis for an abstract interpretation:
We will define a collecting semantics that maps CFG nodes to sets of states; i.e., for each CFG node n, the collecting semantics tells us what states can arise just before n is executed. The "approximate semantics" that we define using abstract interpretation will compute, for each CFG node, (a finite representation of) a superset of the set of states computed for that node by the collecting semantics. By showing that our abstract interpretation really does compute a superset of the possible states that can arise at each CFG node, we show that it is consistent with the program's actual semantics.
Because the collecting semantics involves sets of states, we need to define transfer functions whose inputs and outputs are sets of states. We'll define one function fn→m for each CFG edge n→m. That transfer function will be defined in terms of the (original) transfer function fn defined for the CFG node n:
For example, the transfer function for edge 2→3 of the example CFG given above would be defined as follows:
What is the transfer function (for the collecting semantics) for edge 4→5 of the example CFG?
Our collecting semantics will be of type CFG-node → set-of-states. The (recursive) definition is given below. It defines the set of states that holds just before node n to be the union of the sets of states produced by applying the transfer functions of all of n's in-edges to the sets of states that hold just before the sources of those in-edges execute.
recColl = λn. | if isEnterNode(n) then { all states } | |
else | let P = preds(n) in ∪p ∈ P fp→n(recColl(p)) |
And here's the non-recursive definition:
coll = fix(λF.λn. | if isEnterNode(n) then { all states } | |
else | let P = preds(n) in ∪p ∈ P fp→n(F(p)) |
The values computed for iterations 9 and 10 are the same, so line 9 of the table defines function coll.
Iteration # | Node 1 | Node 2 | Node 3 | Node 4 | Node 5 | Node 6 | Node 7 |
---|---|---|---|---|---|---|---|
0 | ∅ | ∅ | ∅ | ∅ | ∅ | ∅ | ∅ |
1 | [ * * * ] | ∅ | ∅ | ∅ | ∅ | ∅ | ∅ |
2 | [ * * * ] | [ * * * ] | ∅ | ∅ | ∅ | ∅ | ∅ |
3 | [ * * * ] | [ * * * ] | [ 1 * * ] | ∅ | ∅ | ∅ | ∅ |
4 | [ * * * ] | [ * * * ] | [ 1 * * ] | [ 1 1 * ] | ∅ | ∅ | ∅ |
5 | [ * * * ] | [ * * * ] | [ 1 * * ] | [ 1 1 * ] | [ 1 1 * ] | ∅ | ∅ |
6 | [ * * * ] | [ * * * ] | [ 1 * * ] | [ 1 1 * ] [ 2 1 * ] | [ 1 1 * ] | ∅ | ∅ |
7 | [ * * * ] | [ * * * ] | [ 1 * * ] | [ 1 1 * ] [ 2 1 * ] | [ 1 1 * ] [ 2 1 * ] | ∅ | ∅ |
8 | [ * * * ] | [ * * * ] | [ 1 * * ] | [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] | [ 1 1 * ] [ 2 1 * ] | ∅ | ∅ |
9 | [ * * * ] | [ * * * ] | [ 1 * * ] | [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] | [ 1 1 * ] [ 2 1 * ] | [ 3 1 * ] | ∅ |
10 | [ * * * ] | [ * * * ] | [ 1 * * ] | [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] | [ 1 1 * ] [ 2 1 * ] | [ 3 1 * ] | [ 3 1 4 ] |
11 | [ * * * ] | [ * * * ] | [ 1 * * ] | [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] | [ 1 1 * ] [ 2 1 * ] | [ 3 1 * ] | [ 3 1 4 ] |
What property of the example program allows us to compute coll? What modification to the program would cause the fixed-point iteration to be infinite (and thus not computable)?
To define an abstract interpretation we need to do the following:
This proof obligation is illustrated in the diagram below; the ⊆ relationship that must be proved (step 6) is shown using a purple line in the concrete domain.
Given an abstract interpretation, we can define the abstract semantics recursively or non-recursively, as we did for the collecting semantics. The definitions given below define the abstract semantics as a mapping CFG-node → abstract state. The abstract state that holds at CFG node n (a safe approximation to the set of concrete states that hold just before n executes) is the join of the abstract states produced by applying the abstract transfer functions of all of node n's incoming CFG edges to the abstract states that hold before those edges' source nodes.
recAbs = λn. | if isEnterNode(n) then α({all states}) | |
else | let P = preds(n) in | |
∪p ∈ P f#p→n(recAbs(p)) |
And here's the non-recursive definition:
abs = fix(λF.λn. | if isEnterNode(n) then α({all states}) | |
else | let P = preds(n) in | |
∪p ∈ P f#p→n(F(p)) |
Here is a definition of constant propagation:
The ordering of the abstract domain is based on the underlying flat ordering of individual values in which ? is the top element, and all other values are incomparable. Given two abstract states, a1 and a2, a1 ≤ a2 iff
A Galois insertion is a stronger relationship than a Galois connection. Functions α and γ form a Galois insertion iff
Show that functions α and γ defined above for constant propagation form a Galois insertion (by proving point 2 above).
How does abstract interpretation compare with the kind of dataflow analysis studied in CS 701? One thing that may look like a significant difference but in fact is not, is the way the analyses are actually carried out. In 701, we did the following:
Abstract interpretation is similar: To ensure termination of an iterative algorithm, the abstract domain must be a complete lattice with no infinite ascending chains, and the abstract transfer functions must still be monotonic. The solution is computed by iterating up from bottom, instead of down from top, but since a complete lattice is "symmetric", this is just a cosmetic difference.
The real difference between the two approaches is that for 701-style dataflow analysis, we define the lattice elements and the CFG-node dataflow functions based only on intuition. Therefore, there is no guarantee that the solution to a dataflow problem has any relationship to the program's semantics. In contrast, since part of abstract interpretation involves showing relationships between the concrete and abstract semantics, we do have such guarantees.
The price we pay is that it is not always clear how to define dataflow problems of interest using abstract interpretation. For example, problems like reaching definitions require knowing more than just the sets of states that can arise at each CFG node: we also need to know which other CFG nodes assigned the values to the variables. This is usually done by defining an instrumented collecting semantics, which keeps additional information (like the label of the CFG node that most recently assigned to a variable) in a state. While this allows reaching definitions to be defined, it may seem rather ad hoc.
A similar issue arises with backward problems like live variable analysis. One interesting approach to defining the live-variables problem using abstract interpretation and continuation semantics is given in a set of lecture notes called Introduction to Abstract Interpretation by Mads Rosendahl.