Abstract Interpretation


Motivation and Overview

Static Analysis involves finding properties of programs without actually running them. There are many reasons why people want to do static analysis, including the following:

Most interesting properties of programs are undecidable, and even those that are not may be very expensive to compute. Therefore, static analysis usually involves some kind of abstraction. For example, instead of keeping track of all of the values that a variable may have at each point in a program, we might only keep track of whether a variable's value is positive, negative, zero, or unknown. Abstraction makes it possible to discover interesting properties of programs, but the results of static analysis are usually incomplete: for example, an analysis may say that a variable's value is unknown at a point when in fact the value will always be positive at that point.

In CS 701 we studied Dataflow Analysis. That is a commonly used framework for static analysis. However, one problem with standard dataflow analysis is that it provides no guarantees that the results are consistent with the program's semantics. In contrast, abstract interpretation is a static-analysis framework that does guarantee that the information gathered about a program is a safe approximation to the program's semantics. This property is achieved by establishing key relationships between the static analysis and the formal semantics.

Example 1: Rule of Sign

Let's start with a very simple example: We'll define an abstract interpretation of a language of integer expressions including only literals, addition, and multiplication. The goal of the abstract interpretation will be to determine whether each (sub)-expression is negative, zero, or positive. For example, if we know that xxx is negative, while both yyy and zzz are positive. we can determine that the expression

is negative without actually knowing the values of xxx, yyy, and zzz: Adding two positive integers yields a positive integer, and multiplying a negative integer times a positive one yields a negative integer.

Now let's formalize these ideas.

Syntax. Expressions involve literals, addition, and multiplication.

Standard Interpretation. We define the standard interpretation of expressions using denotational semantics; i.e., we add the definitions of the valuation functions below. Note that the (standard) meaning of an integer expression is an Int.

Abstract Interpretation. We want our abstract interpretation to tell us whether an expression is negative, zero, or positive. However, we can't always do that. For example, a negative plus a positive can be either negative or positive. Therefore, our abstract domain, which we'll call Sign, must include a "don't know" value, num.

We'll define the valuation functions for the abstract interpretation in terms of the two tables below, which define abstract addition and multiplication operations.

neg zero pos num
neg neg neg num num
zero neg zero pos num
pos num pos pos num
num num num num num

neg zero pos num
neg pos zero neg num
zero zero zero zero zero
pos neg zero pos num
num num zero num num

Here are the abstract valuation functions; note that the abstract meaning of an integer expression is a Sign.

And here is an example of applying the abstract interpretation to an expression:

Relationship between standard and abstract interpretations.

The abstract interpretation defined above for the "rule-of-signs" example was very simple and intuitive. Assuming that I didn't make any typographical errors when typing in the two tables, it shouldn't be hard to convince yourself that the abstract semantics is consistent with the standard (concrete) semantics. However, to be sure that this consistency holds, we must do the following:

  1. Define two partially-ordered sets (posets) C and A. The elements of C are all non-empty sets of values from the concrete domain (i.e., sets of integers). The elements of A are the values from the abstract domain (i.e., neg, zero, pos, and num).
  2. Define an abstraction function α that maps non-empty sets of integer values to Sign values (i.e., α is of type C → A).
  3. Define a concretization function γ that maps Sign values to non-empty sets of integer values (i.e., γ is of type A → C).
  4. Show that α and γ form a Galois connection (defined below).

  5. For each possible form of expression exp, show that
      { E[[exp]] } ⊆ γ(Eabs[[exp]])
    where ⊆ is the ordering of poset C, i.e., the subset ordering.

1. Abstraction function α.

For the rule-of-signs example, the abstraction function is defined as follows:

2. Concretization function γ.

And the concretization function is defined as follows:

3. Galois Connection.

A Galois connection is a pair of functions, α and γ between two partially ordered sets (C, ⊆) and (A, ≤), such that both of the following hold.

  1. ∀ a ∈ A, c ∈ C: α(c) ≤ a iff c ⊆ γ(a)
  2. ∀ a ∈ A: α(γ(a)) ≤ a

Here are the two relationships we need, presented pictorially:

For our example, poset A is the set containing the four elements of Sign (with num as the top element, and no ordering relationship among the other three elements), and poset C is the set of all sets of integers, ordered by subset. Here is a picture with all of A and some of C. Some of the alpha mapping (the abstraction function) is shown using red arrows, and some of the gamma mapping (the concretization function) is shown using blue arrows.


Question 1: Fill in the remaining alpha and gamma edges in the figure above.

Question 2: Show that alpha and gamma do form a Galois connection.


4. Safety. Our final obligation in proving that our rule-of-signs abstract interpretation is consistent with the standard semantics is to prove that, for every expression exp,

This can be done using structural induction.

Base case: exp is literal k. This case has three parts (based on the definition of Eabs):

  1. k < 0: In this case, and {k} is a subset of {all negative ints} (case proved).

  2. k = 0: In this case, and {0} is a subset of {0} (case proved).

  3. k > 0: In this case, and {k} is a subset of {all positive ints} (case proved).

Inductive Step

The inductive step is quite tedious. There are two cases (one for addition and one for multiplication), and each has 16 sub-cases (for all possible combinations of the signs of the two sub-expressions). Here is one example to show the flavor of the proof.

Inductive case 1: exp is e1 + e2.


In what way does proving that

show that our rule-of-signs abstract interpretation is consistent with the standard semantics?


Standard and Collecting Semantics for CFGs

For the simple rule-of-signs example, we were able to define an abstract interpretation as a variation on the standard denotational semantics. For more realistic static-analysis problems, however, the standard denotational semantics is usually not a good place to start. This is because we usually want the results of static analysis to tell us what holds at each point in the program, and program points are usually defined to be the nodes of the program's control-flow graph (CFG). For example, for constant propagation we want to know, for each CFG node, which variables are guaranteed to have constant values when execution reaches that node. Therefore, it is better to start with a (standard) semantics defined in terms of a CFG.

Standard Semantics

There are various ways to define a CFG semantics. The most straightforward is to define what is called an operational semantics; think of it as an interpreter whose input is the entry node of a CFG plus an initial state (a mapping from variables to values), and whose output is the program's final state. We'll define the standard sementics in terms of transfer functions, one for each CFG node. These are (semantic) functions whose inputs are states and whose outputs are pairs that include both an output state and the CFG node that is the appropriate successor. A node's transfer function captures the execution semantics of that node and specifies the next node to be executed.

For example, consider the CFG shown below (with labels on the nodes).

For this example, the transfer function for node 2 would be defined as follows:

where s[a ← 1] means "a new state that is the same as s except that it maps variable a to 1." For node 4, the transfer function would be In this case, the output state is the same as the input state; the successor node depends on whether variable a is less than 3 in the current (input) state.

Here's a (recursive) definition of the interpreter (the operational semantics). We use fn to mean the transfer function defined for CFG node n.

Because this definition is recursive, we need to use the usual trick of abstracting on the function and defining the operational semantics as the least fixed point of that abstraction:

Collecting Semantics

While the operational semantics discussed above is defined in terms of the program's CFG, it has two properties that are undesirable as the basis for an abstract interpretation:

  1. It is still just a function from a program's input state to its final state; the result of applying the operational semantics tells us nothing about the intermediate states that arise at each CFG node.
  2. It maps a particular initial state to the corresponding final state. We want a semantics that tells us what can happen for every possible initial state.
The advantage of abstract interpretation compared to the kind of dataflow analysis we studied in CS 701 is that it provides a guarantee about the relationship between the program's semantics and the analysis results. To obtain that advantage, we need a semantics that includes information about the set of states that can arise at each CFG node given any possible initial state. That kind of semantics is called a collecting semantics.

We will define a collecting semantics that maps CFG nodes to sets of states; i.e., for each CFG node n, the collecting semantics tells us what states can arise just before n is executed. The "approximate semantics" that we define using abstract interpretation will compute, for each CFG node, (a finite representation of) a superset of the set of states computed for that node by the collecting semantics. By showing that our abstract interpretation really does compute a superset of the possible states that can arise at each CFG node, we show that it is consistent with the program's actual semantics.

Because the collecting semantics involves sets of states, we need to define transfer functions whose inputs and outputs are sets of states. We'll define one function fn→m for each CFG edge n→m. That transfer function will be defined in terms of the (original) transfer function fn defined for the CFG node n:

For example, the transfer function for edge 2→3 of the example CFG given above would be defined as follows:



What is the transfer function (for the collecting semantics) for edge 4→5 of the example CFG?


Our collecting semantics will be of type CFG-nodeset-of-states. The (recursive) definition is given below. It defines the set of states that holds just before node n to be the union of the sets of states produced by applying the transfer functions of all of n's in-edges to the sets of states that hold just before the sources of those in-edges execute.

And here's the non-recursive definition:

For our example program, we can actually find coll by iterating up from bottom. The elements of concrete poset C are sets of states (each with a value for variables a, b, and c) and the ordering is subset. This means that the bottom element of the poset is the empty set, and the bottom function is the one that ignores its input and returns the empty set. Below is a table that shows the computation of coll. We use the notation [ v1 v2 v3 ] to mean a state in which a=v1, b=v2, and c=v3. A tuple with a star, e.g., [1 * *], represents an infinite set of states, including all possible values in place of the star (so [ * * * ] represents all states, and [1 * *] represents all states in which the only constraint is that a=1).

The values computed for iterations 9 and 10 are the same, so line 9 of the table defines function coll.

Iteration # Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7
1 [ * * * ]
2 [ * * * ] [ * * * ]
3 [ * * * ] [ * * * ] [ 1 * * ]
4 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ]
5 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ] [ 1 1 * ]
6 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ] [ 2 1 * ] [ 1 1 * ]
7 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ] [ 2 1 * ] [ 1 1 * ] [ 2 1 * ]
8 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] [ 1 1 * ] [ 2 1 * ]
9 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] [ 1 1 * ] [ 2 1 * ] [ 3 1 * ]
10 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] [ 3 1 4 ]
11 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] [ 3 1 4 ]


What property of the example program allows us to compute coll? What modification to the program would cause the fixed-point iteration to be infinite (and thus not computable)?


Abstract Interpretation

To define an abstract interpretation we need to do the following:

Given an abstract interpretation, we can define the abstract semantics recursively or non-recursively, as we did for the collecting semantics. The definitions given below define the abstract semantics as a mapping CFG-node → abstract state. The abstract state that holds at CFG node n (a safe approximation to the set of concrete states that hold just before n executes) is the join of the abstract states produced by applying the abstract transfer functions of all of node n's incoming CFG edges to the abstract states that hold before those edges' source nodes.

And here's the non-recursive definition:

Example: Constant Propagation

Here is a definition of constant propagation:


A Galois insertion is a stronger relationship than a Galois connection. Functions α and γ form a Galois insertion iff

  1. α and γ form a Galois connection, and
  2. for all a in the abstract domain: α(γ(a)) = a

Show that functions α and γ defined above for constant propagation form a Galois insertion (by proving point 2 above).


Comparison with CS 701-style Dataflow Analysis

How does abstract interpretation compare with the kind of dataflow analysis studied in CS 701? One thing that may look like a significant difference but in fact is not, is the way the analyses are actually carried out. In 701, we did the following:

Abstract interpretation is similar: To ensure termination of an iterative algorithm, the abstract domain must be a complete lattice with no infinite ascending chains, and the abstract transfer functions must still be monotonic. The solution is computed by iterating up from bottom, instead of down from top, but since a complete lattice is "symmetric", this is just a cosmetic difference.

The real difference between the two approaches is that for 701-style dataflow analysis, we define the lattice elements and the CFG-node dataflow functions based only on intuition. Therefore, there is no guarantee that the solution to a dataflow problem has any relationship to the program's semantics. In contrast, since part of abstract interpretation involves showing relationships between the concrete and abstract semantics, we do have such guarantees.

The price we pay is that it is not always clear how to define dataflow problems of interest using abstract interpretation. For example, problems like reaching definitions require knowing more than just the sets of states that can arise at each CFG node: we also need to know which other CFG nodes assigned the values to the variables. This is usually done by defining an instrumented collecting semantics, which keeps additional information (like the label of the CFG node that most recently assigned to a variable) in a state. While this allows reaching definitions to be defined, it may seem rather ad hoc.

A similar issue arises with backward problems like live variable analysis. One interesting approach to defining the live-variables problem using abstract interpretation and continuation semantics is given in a set of lecture notes called Introduction to Abstract Interpretation by Mads Rosendahl.