# Abstract Interpretation

## Motivation and Overview

Static Analysis involves finding properties of programs without actually running them. There are many reasons why people want to do static analysis, including the following:

• to find opportunities for applying transformations to speed up execution;
• to know whether the code is vulnerable to attack;
• to know whether a program's runtime will always be acceptably fast;
• to find (potential) bugs.

Most interesting properties of programs are undecidable, and even those that are not may be very expensive to compute. Therefore, static analysis usually involves some kind of abstraction. For example, instead of keeping track of all of the values that a variable may have at each point in a program, we might only keep track of whether a variable's value is positive, negative, zero, or unknown. Abstraction makes it possible to discover interesting properties of programs, but the results of static analysis are usually incomplete: for example, an analysis may say that a variable's value is unknown at a point when in fact the value will always be positive at that point.

In CS 701 we studied Dataflow Analysis. That is a commonly used framework for static analysis. However, one problem with standard dataflow analysis is that it provides no guarantees that the results are consistent with the program's semantics. In contrast, abstract interpretation is a static-analysis framework that does guarantee that the information gathered about a program is a safe approximation to the program's semantics. This property is achieved by establishing key relationships between the static analysis and the formal semantics.

## Example 1: Rule of Sign

Let's start with a very simple example: We'll define an abstract interpretation of a language of integer expressions including only literals, addition, and multiplication. The goal of the abstract interpretation will be to determine whether each (sub)-expression is negative, zero, or positive. For example, if we know that xxx is negative, while both yyy and zzz are positive. we can determine that the expression

xxx * (yyy + zzz)
is negative without actually knowing the values of xxx, yyy, and zzz: Adding two positive integers yields a positive integer, and multiplying a negative integer times a positive one yields a negative integer.

Now let's formalize these ideas.

Syntax. Expressions involve literals, addition, and multiplication.

 exp → n // literals → exp + exp // addition → exp * exp // multiplication

Standard Interpretation. We define the standard interpretation of expressions using denotational semantics; i.e., we add the definitions of the valuation functions below. Note that the (standard) meaning of an integer expression is an Int.

• E: Exp → Int

E[[n]] = n
E[[E1 + E2]] = E[[E1]] + E[[E2]]
E[[E1 * E2]] = E[[E1]] * E[[E2]]

Abstract Interpretation. We want our abstract interpretation to tell us whether an expression is negative, zero, or positive. However, we can't always do that. For example, a negative plus a positive can be either negative or positive. Therefore, our abstract domain, which we'll call Sign, must include a "don't know" value, num.

Sign = { zero, neg, pos, num }
We'll define the valuation functions for the abstract interpretation in terms of the two tables below, which define abstract addition and multiplication operations.

neg zero pos num
neg neg neg num num
zero neg zero pos num
pos num pos pos num
num num num num num

neg zero pos num
neg pos zero neg num
zero zero zero zero zero
pos neg zero pos num
num num zero num num

Here are the abstract valuation functions; note that the abstract meaning of an integer expression is a Sign.

• Eabs: Exp → Sign

Eabs[[n]] = if (n<0) then neg else if (n=0) then zero else pos
Eabs[[E1 + E2]] = Eabs[[E1]] Eabs[[E2]]
Eabs[[E1 * E2]] = Eabs[[E1]] Eabs[[E2]]

And here is an example of applying the abstract interpretation to an expression:

Eabs[[ -22 * (14 + 7) ]] =
Eabs[[-22]] Eabs[[14 + 7]] =
neg (Eabs[] Eabs[] =
neg (pos pos) =
neg pos =
neg

## Relationship between standard and abstract interpretations.

The abstract interpretation defined above for the "rule-of-signs" example was very simple and intuitive. Assuming that I didn't make any typographical errors when typing in the two tables, it shouldn't be hard to convince yourself that the abstract semantics is consistent with the standard (concrete) semantics. However, to be sure that this consistency holds, we must do the following:

1. Define two partially-ordered sets (posets) C and A. The elements of C are all non-empty sets of values from the concrete domain (i.e., sets of integers). The elements of A are the values from the abstract domain (i.e., neg, zero, pos, and num).
2. Define an abstraction function α that maps non-empty sets of integer values to Sign values (i.e., α is of type C → A).
3. Define a concretization function γ that maps Sign values to non-empty sets of integer values (i.e., γ is of type A → C).
4. Show that α and γ form a Galois connection (defined below).

5. For each possible form of expression exp, show that
{ E[[exp]] } ⊆ γ(Eabs[[exp]])
where ⊆ is the ordering of poset C, i.e., the subset ordering.

1. Abstraction function α.

For the rule-of-signs example, the abstraction function is defined as follows:

 α({0}) = zero α(S) = if all values in S are greater than 0 then pos else if all values in S are less than 0 then neg else num

2. Concretization function γ.

And the concretization function is defined as follows:

 γ(zero) = {0} γ(pos) = {all positive ints} γ(neg) = {all negative ints} γ(num) = Int (i.e., all ints)

3. Galois Connection.

A Galois connection is a pair of functions, α and γ between two partially ordered sets (C, ⊆) and (A, ≤), such that both of the following hold.

1. ∀ a ∈ A, c ∈ C: α(c) ≤ a iff c ⊆ γ(a)
2. ∀ a ∈ A: α(γ(a)) ≤ a

Here are the two relationships we need, presented pictorially: For our example, poset A is the set containing the four elements of Sign (with num as the top element, and no ordering relationship among the other three elements), and poset C is the set of all sets of integers, ordered by subset. Here is a picture with all of A and some of C. Some of the alpha mapping (the abstraction function) is shown using red arrows, and some of the gamma mapping (the concretization function) is shown using blue arrows. TEST YOURSELF #1

Question 1: Fill in the remaining alpha and gamma edges in the figure above.

Question 2: Show that alpha and gamma do form a Galois connection.

solution

4. Safety. Our final obligation in proving that our rule-of-signs abstract interpretation is consistent with the standard semantics is to prove that, for every expression exp,

{ E[[exp]] } ⊆ γ(Eabs[[exp]])
This can be done using structural induction.

Base case: exp is literal k. This case has three parts (based on the definition of Eabs):

1. k < 0: In this case,
 E[[k]] = k // def of E Eabs[[k]] = neg // def of Eabs γ(neg) = {all negative ints} // def of γ
and {k} is a subset of {all negative ints} (case proved).

2. k = 0: In this case,
 E[[k]] = 0 // def of E Eabs[[k]] = zero // def of Eabs γ(zero) = {0} // def of γ
and {0} is a subset of {0} (case proved).

3. k > 0: In this case,
 E[[k]] = k // def of E Eabs[[k]] = pos // def of Eabs γ(pos) = {all positive ints} // def of γ
and {k} is a subset of {all positive ints} (case proved).

Inductive Step

The inductive step is quite tedious. There are two cases (one for addition and one for multiplication), and each has 16 sub-cases (for all possible combinations of the signs of the two sub-expressions). Here is one example to show the flavor of the proof.

Inductive case 1: exp is e1 + e2.

 RHS: γ(Eabs[[e1 + e2]]) = γ(Eabs[[e1]] ⊕ Eabs[[e2]]) // def of Eabs

sub-case 1: both Eabs[[e1]] and Eabs[[e2]] are neg.  = γ(neg ⊕ neg) = γ(neg) // def of ⊕ = { all negative ints } // def of γ LHS: E[[e1 + e2]] = E[[e1]] + E[[e2]] // def of E

By the induction hypothesis, E[[e1]] is a subset of γ(Eabs[[e1]]), which is γ(neg), which is the set of all negative ints. The same applies to E[[e1]]. Thus, the LHS is the sum of two negative ints, which is a negative int, which is certainly a subset of { all negative ints } (the final value for the RHS).

TEST YOURSELF #2

In what way does proving that

{ E[[exp]] } ⊆ γ(Eabs[[exp]])
show that our rule-of-signs abstract interpretation is consistent with the standard semantics?

solution

## Standard and Collecting Semantics for CFGs

For the simple rule-of-signs example, we were able to define an abstract interpretation as a variation on the standard denotational semantics. For more realistic static-analysis problems, however, the standard denotational semantics is usually not a good place to start. This is because we usually want the results of static analysis to tell us what holds at each point in the program, and program points are usually defined to be the nodes of the program's control-flow graph (CFG). For example, for constant propagation we want to know, for each CFG node, which variables are guaranteed to have constant values when execution reaches that node. Therefore, it is better to start with a (standard) semantics defined in terms of a CFG.

### Standard Semantics

There are various ways to define a CFG semantics. The most straightforward is to define what is called an operational semantics; think of it as an interpreter whose input is the entry node of a CFG plus an initial state (a mapping from variables to values), and whose output is the program's final state. We'll define the standard sementics in terms of transfer functions, one for each CFG node. These are (semantic) functions whose inputs are states and whose outputs are pairs that include both an output state and the CFG node that is the appropriate successor. A node's transfer function captures the execution semantics of that node and specifies the next node to be executed.

```          +----------+
| 1: start |
+----------+
|
v
+----------+
| 2: a = 1 |
+----------+
|
v
+----------+
| 3: b = 1 |
+----------+
|
v
+----------+  F  +------------+     +---------+
+---> | 4: a < 3 |---->| 6: c = a+b |---->| 7: exit |
|     +----------+     +------------+     +---------+
|           |
|         T |
|           v
|     +------------+
|     | 5: a = a+b |
|     +------------+
|           |
|           |
+-----------+
```

For this example, the transfer function for node 2 would be defined as follows:

λs.(s[a ← 1], 3)
where s[a ← 1] means "a new state that is the same as s except that it maps variable a to 1." For node 4, the transfer function would be
λs.(if lookup(s, a) < 3 then (s, 5) else (s, 6))
In this case, the output state is the same as the input state; the successor node depends on whether variable a is less than 3 in the current (input) state.

Here's a (recursive) definition of the interpreter (the operational semantics). We use fn to mean the transfer function defined for CFG node n.

 interp = λs.λn. if isExitNode(n) then s else let (s', n') = fn(s) in interp s' n'

Because this definition is recursive, we need to use the usual trick of abstracting on the function and defining the operational semantics as the least fixed point of that abstraction:

 semantics = fix(λF.λs.λn. if isExitNode(n) then s else let (s', n') = fn(s) in F s' n')

### Collecting Semantics

While the operational semantics discussed above is defined in terms of the program's CFG, it has two properties that are undesirable as the basis for an abstract interpretation:

1. It is still just a function from a program's input state to its final state; the result of applying the operational semantics tells us nothing about the intermediate states that arise at each CFG node.
2. It maps a particular initial state to the corresponding final state. We want a semantics that tells us what can happen for every possible initial state.
The advantage of abstract interpretation compared to the kind of dataflow analysis we studied in CS 701 is that it provides a guarantee about the relationship between the program's semantics and the analysis results. To obtain that advantage, we need a semantics that includes information about the set of states that can arise at each CFG node given any possible initial state. That kind of semantics is called a collecting semantics.

We will define a collecting semantics that maps CFG nodes to sets of states; i.e., for each CFG node n, the collecting semantics tells us what states can arise just before n is executed. The "approximate semantics" that we define using abstract interpretation will compute, for each CFG node, (a finite representation of) a superset of the set of states computed for that node by the collecting semantics. By showing that our abstract interpretation really does compute a superset of the possible states that can arise at each CFG node, we show that it is consistent with the program's actual semantics.

Because the collecting semantics involves sets of states, we need to define transfer functions whose inputs and outputs are sets of states. We'll define one function fn→m for each CFG edge n→m. That transfer function will be defined in terms of the (original) transfer function fn defined for the CFG node n:

fn→m = λS.{s' | s∈S and fn(s) = (s', m)}

For example, the transfer function for edge 2→3 of the example CFG given above would be defined as follows:

λS.{s[a ← 1] | s ∈ S}
.

TEST YOURSELF #3

What is the transfer function (for the collecting semantics) for edge 4→5 of the example CFG?

solution

Our collecting semantics will be of type CFG-nodeset-of-states. The (recursive) definition is given below. It defines the set of states that holds just before node n to be the union of the sets of states produced by applying the transfer functions of all of n's in-edges to the sets of states that hold just before the sources of those in-edges execute.

 recColl = λn. if isEnterNode(n) then { all states } else let P = preds(n) in ∪p ∈ P fp→n(recColl(p))

And here's the non-recursive definition:

 coll = fix(λF.λn. if isEnterNode(n) then { all states } else let P = preds(n) in ∪p ∈ P fp→n(F(p))
For our example program, we can actually find coll by iterating up from bottom. The elements of concrete poset C are sets of states (each with a value for variables a, b, and c) and the ordering is subset. This means that the bottom element of the poset is the empty set, and the bottom function is the one that ignores its input and returns the empty set. Below is a table that shows the computation of coll. We use the notation [ v1 v2 v3 ] to mean a state in which a=v1, b=v2, and c=v3. A tuple with a star, e.g., [1 * *], represents an infinite set of states, including all possible values in place of the star (so [ * * * ] represents all states, and [1 * *] represents all states in which the only constraint is that a=1).

The values computed for iterations 9 and 10 are the same, so line 9 of the table defines function coll.

Iteration # Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7
0
1 [ * * * ]
2 [ * * * ] [ * * * ]
3 [ * * * ] [ * * * ] [ 1 * * ]
4 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ]
5 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ] [ 1 1 * ]
6 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ] [ 2 1 * ] [ 1 1 * ]
7 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ] [ 2 1 * ] [ 1 1 * ] [ 2 1 * ]
8 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] [ 1 1 * ] [ 2 1 * ]
9 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] [ 1 1 * ] [ 2 1 * ] [ 3 1 * ]
10 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] [ 3 1 4 ]
11 [ * * * ] [ * * * ] [ 1 * * ] [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] [ 1 1 * ] [ 2 1 * ] [ 3 1 * ] [ 3 1 4 ]

TEST YOURSELF #4

What property of the example program allows us to compute coll? What modification to the program would cause the fixed-point iteration to be infinite (and thus not computable)?

solution

## Abstract Interpretation

To define an abstract interpretation we need to do the following:

• Define the abstract domain A, the abstraction function α, and the concretization function γ.
• Show that α and γ form a Galois connection.
• For each CFG edge n→m, define an abstract transfer function f#n→m.
• Show that the abstract transfer functions are consistent with the concrete ones; i.e., for each abstract f# and corresponding concrete f:
1. start with an arbitrary concrete-domain item c
2. let c' = f(c)
3. let a = α(c)
4. let a' = f#(a)
5. let c'' = γ(a')
6. show that c'c''

This proof obligation is illustrated in the diagram below; the ⊆ relationship that must be proved (step 6) is shown using a purple line in the concrete domain. Given an abstract interpretation, we can define the abstract semantics recursively or non-recursively, as we did for the collecting semantics. The definitions given below define the abstract semantics as a mapping CFG-node → abstract state. The abstract state that holds at CFG node n (a safe approximation to the set of concrete states that hold just before n executes) is the join of the abstract states produced by applying the abstract transfer functions of all of node n's incoming CFG edges to the abstract states that hold before those edges' source nodes.

 recAbs = λn. if isEnterNode(n) then α({all states}) else let P = preds(n) in ∪p ∈ P f#p→n(recAbs(p))

And here's the non-recursive definition:

 abs = fix(λF.λn. if isEnterNode(n) then α({all states}) else let P = preds(n) in ∪p ∈ P f#p→n(F(p))

## Example: Constant Propagation

Here is a definition of constant propagation:

• The elements of the abstract domain A are abstract states that map variables to values, including the special value ? (which means that the corresponding set of concrete states includes states that map the variable to different values). The abstract domain also includes a special bottom element ⊥.

The ordering of the abstract domain is based on the underlying flat ordering of individual values in which ? is the top element, and all other values are incomparable. Given two abstract states, a1 and a2, a1a2 iff

• a1 is ⊥, or
• every variable x mapped to a non-? value in a2 is mapped to the same value in a1.

• The concrete domain is the one defined earlier, whose elements are sets of states (each with a value for every variable), and whose ordering is subset (i.e., S1 ⊆ S2 iff S1 is a subset of S2).

• The abstraction function maps the empty set to ⊥; it maps every non-empty set S of concrete states to a single abstract state: For each variable x, if x has the same value v in every concrete state in S, then it is mapped to v in the abstract state. Otherwise, it is mapped to the special value ?.

• The concretization function is the obvious dual of the abstraction function: It maps ⊥ to the empty set. Given an abstract state a, if no variable is mapped to ? in a, then the concrete set of states S contains just one state a. Otherwise, S is an infinite set. For every variable x that is mapped to a non-? value v in a, every state s in S maps x to v; all other variables are mapped to all possible combinations of values.

• It should be clear that, as defined, α and γ form a Galois connection.

• The abstract transfer functions are essentially the ones we used when defining constant propagation in CS 701. All abstract transfer functions are strict: if their input is bottom, then their output is also bottom. Otherwise, transfer function f#n→m is defined as follows:
• If node n doesn't modify any variables, then fn→m is the identity function
• If node n represents x = y + z, then fn→m is defined as follows:
λs.s[x ← lookup(s, y) ⊕ lookup(s, z)]
where s is an abstract state, and ⊕ returns ? if either of its arguments is ? (and otherwise is the same as regular +).

TEST YOURSELF #5

A Galois insertion is a stronger relationship than a Galois connection. Functions α and γ form a Galois insertion iff

1. α and γ form a Galois connection, and
2. for all a in the abstract domain: α(γ(a)) = a

Show that functions α and γ defined above for constant propagation form a Galois insertion (by proving point 2 above).

solution

## Comparison with CS 701-style Dataflow Analysis

How does abstract interpretation compare with the kind of dataflow analysis studied in CS 701? One thing that may look like a significant difference but in fact is not, is the way the analyses are actually carried out. In 701, we did the following:

• Define a complete lattice L with no infinite descending chains. The elements of L are the dataflow facts.
• Specify one lattice element as the special "initial" value.
• For each CFG edge n→m, define a monotonic function fn→m of type L → L. The function for the edge out of the enter node ignores its input and produces the special initial value as its output. (Note: We sometimes defined the functions on CFG nodes rather than on CFG edges. There are examples where putting the functions on the nodes is more convenient, and other examples where putting the functions on the edges is more convenient. In both cases, the functions capture the same semantics, so there is not really a significant difference.)
• Define a cross-product lattice LX whose tuples have as many items as there are CFG nodes. Also define a (monotonic) function F of type LX → LX, that uses functions fm→n to define the nth "slot" of a tuple.
• To solve a dataflow problem, iterate down from top; i.e., start with the top element of lattice LX, then apply function F repeatedly until there is no change.

Abstract interpretation is similar: To ensure termination of an iterative algorithm, the abstract domain must be a complete lattice with no infinite ascending chains, and the abstract transfer functions must still be monotonic. The solution is computed by iterating up from bottom, instead of down from top, but since a complete lattice is "symmetric", this is just a cosmetic difference.

The real difference between the two approaches is that for 701-style dataflow analysis, we define the lattice elements and the CFG-node dataflow functions based only on intuition. Therefore, there is no guarantee that the solution to a dataflow problem has any relationship to the program's semantics. In contrast, since part of abstract interpretation involves showing relationships between the concrete and abstract semantics, we do have such guarantees.

The price we pay is that it is not always clear how to define dataflow problems of interest using abstract interpretation. For example, problems like reaching definitions require knowing more than just the sets of states that can arise at each CFG node: we also need to know which other CFG nodes assigned the values to the variables. This is usually done by defining an instrumented collecting semantics, which keeps additional information (like the label of the CFG node that most recently assigned to a variable) in a state. While this allows reaching definitions to be defined, it may seem rather ad hoc.