Research Summary
My research concerns a core issue in Computer Science: how to develop
correct, reliable, and secure software.
To address these issues, my group works on creating better tools for
manipulating programs and analyzing their execution properties.
The work uses ideas that come primarily from the fields of programming languages,
compilers, and software engineering, but there are connections to databases,
machine learning, and even numerical analysis.
Whenever possible, our tools take the form of frameworks that apply to general
classes of problems.
The benefit of developing a solution in the form of a framework is that
it allows the work to address "all of next year's problems, not just
the one [currently] in front of [your] face"
[H86, p. 7].
Four of the frameworks that we are currently working on concern
Many of the tools that we work on are based on static analysis,
which provides a way to obtain information about the possible
states that a program reaches during execution, but without actually
running the program on specific inputs. Instead, static-analysis
techniques explore a program's behavior for all possible inputs
and all possible states that the program can reach. To make this feasible,
the program is "run in the aggregate"—i.e., on descriptors
that represent collections of many states.
Most of our current efforts are focused on the following topics:
A summary of some of my past research work, including work on
frameworks for solving other classes of problems, can be found
here.
In our work on static analysis of stripped executables, we have devised methods
that—without relying on symbol-table or debugging
information—identify variables, data
objects, and types, and track how they are manipulated by the
executable. This information provides answers to such questions as
"what could this dereference operation access?", "what function
could be called at this indirect call site?", etc., which can help
human security analysts understand the workings of COTS components,
plugins, mobile code, and DLLs, as well as memory snapshots of worms
and virus-infected code.
[More]
The TSL system (TSL stands for Transformer Specification Language)
provides a framework for creating
correct-by-construction implementations of the
state-transformation functions needed in
state-space-exploration tools that analyze machine code.
From a single specification of the concrete
semantics of a machine-code instruction set, TSL automatically
generates state-transformation functions needed for
static analysis, dynamic analysis, symbolic analysis, or
any combination of the three.
The generated implementations are guaranteed to be mutually
consistent, and also to be consistent with an instruction-set
emulator that is generated from the same specification of the
concrete semantics.
[More]
Our work on weighted pushdown systems led to a new framework for
interprocedural dataflow analysis that supports a strictly richer set
of queries than previous approaches to interprocedural
dataflow-analysis (a problem that has a 35-year history). We have also
shown that WPDSs are a useful formalism in a completely separate
application domain, namely, access control of shared computing
resources in distributed systems.
[More]
Since 2002, I have been interested in an interesting connection
between abstract interpretation and logic—in
particular, how to harness decision procedures to obtain algorithms
for several fundamental primitives used in abstract interpretation.
In [RSY04],
Mooly Sagiv, Greta Yorsh, and I gave the first method that makes
such a connection: we showed that, for a broad class of situations, given
a formula φ in some logic L, it is possible to find the best value a in
a given abstract domain Abs that over-approximates the
meaning of φ (i.e., [[φ]] ⊆ γ(a)).
Since 2011, my student Aditya Thakur and I have been pursuing
several new insights on this fundamental question.
[More]
Our work on policy weaving focuses on a declarative approach to
system security.
In our approach, a policy specification is a first-class notion—e.g.,
the policy specification for a given application is a single
document, so that all elements of the policy are gathered together in
one place, rather than scattered throughout the codebase.
A security policy is implemented in terms of a collection of
code snippets that invoke policy-enforcement primitives at different
points in the program;
however, users are freed from the task of introducing such snippets
manually by a policy-weaving tool, which takes as input
(i) a program P, and
(ii) a policy R (written in some suitable policy-specification language),
and rewrites P to enforce R.
[More]
Analysis of Executables
In our work on static analysis of stripped executables, we devised
methods that—without relying on symbol-table or debugging
information—recover intermediate representations that are
similar to those the intermediate phases of a compiler creates for a
program written in a high-level language.
The goal is to provide a platform that an analyst can
use to understand the workings of COTS components, plugins, mobile
code, and DLLs, as well as memory snapshots of worms and
virus-infected code.
Our initial work concerned x86 executables.
Starting from a stripped x86 executable, our algorithms identify
variables, data objects, and types, and track how they are manipulated
by the executable. This information provides answers to such
questions as "what could this dereference operation access?", "what
function could be called at this indirect call site?", etc.
This work required overcoming numerous obstacles.
In particular, to track the flow of values through memory, it is
necessary to determine the set of addresses accessed by each
operation. This is difficult because
-
While some memory operations use explicit memory addresses
in the instruction (easy), others use indirect addressing
via address expressions (difficult).
-
Arithmetic on addresses is pervasive. For instance, even when
the value of a local variable is loaded from its slot in an
activation record, address arithmetic is performed.
-
There is no explicit notion of type at the hardware level:
any integer value can be used as an address value.
An additional challenge is that for many programs, symbol-table and
debugging information is entirely absent. Even if it is present,
the executable may be untrustworthy, in which case
symbol-table and debugging information cannot be relied upon.
In 2003, my student Gogul Balakrishnan and I developed a
static-analysis algorithm for x86 machine code that is able to track
the flow of values through memory
[BR04].
The algorithm—called value-set analysis
(VSA)—recovers information about the contents of memory
locations and how they are manipulated by the executable
[BR04,
RBL06,
BR07,
B07,
RB08,
BR10].
A key feature of VSA is that it tracks address-valued and
integer-valued quantities simultaneously. VSA is related to
pointer-analysis algorithms that have been developed for programs
written in high-level languages, which determine an over-approximation
of the set of variables whose addresses each pointer variable can
hold:
VSA determines an over-approximation of the set of addresses that
each data object can hold at each program point.
At the same time, VSA is similar to range analysis and other numeric
static-analysis algorithms that over-approximate the integer values
that each variable can hold:
VSA determines an over-approximation of the set of integer values
that each data object can hold at each program point.
We incorporated VSA, along with algorithms to identify
variables, data objects, and types, in CodeSurfer/x86
[BGRT05]—the
first program-slicing tool for machine code that can help
with understanding dependences across memory updates and memory
accesses.
CodeSurfer/x86 is the outcome of a joint project between the Univ. of
Wisconsin and GrammaTech, Inc.
CodeSurfer/x86 makes use of both
IDAPro,
a disassembly toolkit, and GrammaTech's
CodeSurfer
system, a toolkit for building program-analysis and inspection tools.
CodeSurfer/x86 is a tool for code understanding and code inspection that
supports both a graphical user interface (GUI) and a scripting language for
accessing a program's system dependence graph (SDG)
[HRB90],
as well as other information stored in
CodeSurfer's intermediate representations (IRs).
The facilities of CodeSurfer/x86 provide an
analyst with a powerful and flexible platform for investigating the
properties and behaviors of an x86 executable, using either
the GUI, the scripting language, or GrammaTech's
Path
Inspector, which is a tool that uses a sophisticated pattern-matching
engine to answer questions about the flow of execution in a program.
The GUI supports browsing ("surfing") of an SDG, along with
a variety of operations for making queries about the SDG—such as
slicing
[HRB90]
and chopping
[RR95].
The GUI allows a user to navigate through a program's source code
using these dependences in a manner analogous to navigating the World
Wide Web.
CodeSurfer/x86's scripting language provides a programmatic interface to these
operations, as well as to lower-level information, such as the
individual nodes and edges of the program's SDG, call graph, and
control-flow graph, and a node's sets of used, killed, and
possibly-killed variables.
By writing programs that traverse CodeSurfer's IRs to implement
additional program analyses, the scripting language can be used to extend
CodeSurfer/x86's capabilities.
Later, we extended CodeSurfer/x86 to create the first automatic
program-verification tool for stripped executables: it allows one to
check that a stripped executable conforms to an API-usage rule
(specified as a finite-state machine)
[BRKLLMGYCT05, BR08].
More recently, our work on the TSL system
has allowed us to create similar systems for other instruction sets
automatically, starting from a specification of the concrete
semantics of a machine-code instruction set.
Bibliography:
[Back to the top]
Transformer Specification Language (TSL)
In 2008, my student Junghee Lim and I created the TSL system
[LR08a,
LR12],
a meta-tool to help in the creation of tools for analyzing machine
code. (TSL stands for Transformer Specification
Language.)
From a single specification of the concrete
semantics of a machine-code instruction set, TSL automatically
generates correct-by-construction implementations of the
state-transformation functions needed in
state-space-exploration tools that use static analysis, dynamic
analysis, symbolic analysis, or any combination of the three
[LR08a,
LLR09,
LLR09b].
The TSL language is a strongly typed, first-order functional language
with a datatype-definition mechanism for defining recursive datatypes,
plus deconstruction by means of pattern matching. Writing a TSL
specification for an instruction set is similar to writing an
interpreter in first-order ML. For instance, the specification of an
instruction set's concrete semantics is written as a TSL function
state interpInstr(instruction I, state S) {...};
where instruction and state are user-defined
datatypes that represent the instructions and the semantic states,
respectively.
TSL's meta-language provides a fixed set of base-types; a fixed set of
arithmetic, bitwise, relational, and logical operators; and a facility
for defining map-types. The meanings of the meta-language constructs can be
redefined by supplying alternative interpretations of them.
When semantic reinterpretation is performed in this
way—namely, on the operations of the meta-language—it
is independent of any given instruction set. Consequently,
once a reinterpretation has been defined that reinterprets
TSL in a manner appropriate for some state-space-exploration
method, the same reinterpretation can be applied to
each instruction set whose semantics has been specified
in TSL.
The reinterpretation mechanism causes TSL to act as a
"Yacc-like" tool for automatically generating correct-by-construction
implementations of the state-transformation functions needed in
state-space-exploration tools.
Moreover, because all components are generated from
a single specification of the concrete
semantics of an instruction set,
the generated implementations are guaranteed to be mutually
consistent (and also to be consistent with an instruction-set emulator
that is generated from the same specification of the concrete
semantics).
Bibliography:
[Back to the top]
Weighted Pushdown Systems and their Application to Interprocedural Dataflow Analysis
This work has established new connections between interprocedural
dataflow analysis and model checking of pushdown systems (PDSs).
Various connections between dataflow analysis and model checking
have been established in past work,
e.g., [S91,
S93,
S98,
EK99,
CC00];
however, with one exception ([EK99]),
past work has shed light only on the relationship between
model checking and bit-vector dataflow-analysis problems, such
as live-variable analysis and partial-redundancy elimination.
In contrast, the results obtained in our work
[RSJ03,
RSJM05]
apply to
(i) bit-vector problems, (ii) the one non-bit-vector problem
addressed in [EK99],
as well as (iii) certain dataflow-analysis problems that cannot be expressed
as bit-vector problems,
such as linear constant propagation
[SRH96]
and affine-relation analysis
[MOS04].
In general, the approach can be applied
to any distributive dataflow-analysis problem for which the
domain of transfer functions has no infinite descending chains.
(Safe solutions are also obtained for problems that are
monotonic but not distributive.)
The contributions of the work can be summarized as follows:
-
Conventional dataflow-analysis algorithms merge together the
values for all states associated with the same program point,
regardless of the states' calling context.
With the dataflow-analysis algorithm obtained via weighted PDSs,
dataflow queries can be posed with respect to a regular language
of stack configurations.
Conventional merged dataflow information can also be obtained
by issuing appropriate queries;
thus, the new approach provides a strictly richer framework
for interprocedural dataflow analysis than is provided by
conventional interprocedural dataflow-analysis algorithms.
-
Because the algorithm for solving path problems in weighted PDSs
can provide a witness set of paths, it is possible to provide
an explanation of why the answer to a dataflow query has
the value reported.
The algorithms described in
[RSJ03,
RSJM05], along with
the improvements described in
[LBR05,
LR06,
LKRT07,
LR08b],
have been implemented in three libraries for use in creating
program-analysis tools (one implemented in C
[Schwoon04],
and two implemented in C++
[KRML04,
KLR07]).
These libraries have been used to create prototype implementations of
context-sensitive interprocedural dataflow analyses for uninitialized
variables, live variables, linear constant propagation, and the
detection of affine relationships.
We have also used pushdown systems to create a novel algorithm for program slicing
[AHJR12].
Weighted pushdown systems are also a useful formalism in a completely
separate application domain, namely, access control of shared computing resources
in distributed systems
[JR02,
SJRS03,
JR04,
JSWR06,
WJRSS06].
Bibliography:
Symbolic Methods for Abstract Interpretation
Since 2002, I have been interested in symbolic methods for abstract
interpretation. The work aims to bridge the gap between (i) the use of logic
for specifying program semantics and performing program analysis, and
(ii) abstract interpretation
[RSY04,
YRS04,
TR12a,
TR12b,
TER12,
ELSAR13,
TLLR13].
In [RSY04],
Mooly Sagiv, Greta Yorsh, and I gave the first method that makes
such a connection: we showed that, for a broad class of situations, given
a formula φ in some logic L, it is possible to find the best value a in
a given abstract domain Abs that over-approximates the
meaning of φ (i.e., [[φ]] ⊆ γ(a)).
We call this problem symbolic abstraction.
If you think of logic L as being a rich language for
expressing meanings, and abstract domain Abs as
corresponding to a logic fragment L' that is less
expressive than L, then symbolic abstraction addresses a
fundamental approximation problem:
Given φ ∈ L, find the strongest
consequence of φ that is expressible in logic fragment
L'.
Since 2011, my student Aditya Thakur and I have been pursuing the
insight that generalized versions of an old, and not widely used, method
for validity checking of propositional-logic formulas, called
Stålmarck's method, hold great promise for symbolic abstraction.
The symbolic methods for abstract interpretation that we have
subsequently developed based on this insight
[TR12a,
TR12b,
TER12,
TLLR13]
offer much promise for building more powerful program-analysis tools.
These methods (i) allow more precise implementations
of abstract-interpretation primitives to be created, and (ii) drastically
reduce the time taken to implement such primitives because correctness
is ensured by construction.
In particular, they provide a way to create correct-by-construction abstract
interpreters that also attain the fundamental limits on precision that
abstract-interpretation theory establishes.
For instance, in [TLLR13],
we described a systematic method for implementing an abstract interpreter
that solves the following problem:
Given program P, and an abstract domain A,
find the most-precise inductive A-invariant for P.
Moreover, our generalized-Stålmarck methods are "dual-use":
In addition to providing methods for building improved
abstract-interpretation tools, they also provide methods for building
improved logic solvers that use abstract interpretation to speed up
the search that a solver carries out.
In particular, we showed how Stålmarck's method is an instantiation of
a generic framework parameterized by an abstract domain over Booleans;
consequently, different instantiations of the framework lead to new
decision procedures for propositional logic
[TR12b].
Moreover, this abstraction-based view allows us to lift
Stålmarck's method from propositional logic to richer logics:
to obtain a method for richer logics, instantiate the parameterized
version of Stålmarck's method with richer abstract domains
[TR12a].
We call such a decision-procedure design, which is parameterized by an abstract
domain, a Satisfiability Modulo Abstraction (SMA) solver
[RT14].
Recently, we created an SMA solver for separation logic
[TBR14].
Separation logic (SL) is an expressive logic for reasoning about
heap structures in programs, and provides a mechanism for concisely describing program
states by explicitly localizing facts that hold in separate regions of
the heap.
SL is undecidable in general, but by using an abstract domain of
shapes [SRW02]
we were able to design a semi-decision procedure for SL.
Bibliography:
[Back to the top]
Policy Weaving
System security is a cross-cutting issue.
Unfortunately, it is hard to devise a formalism that allows
system-wide policies to be both (i) easily understood and (ii) easily
implemented and maintained. In particular, when the elements of
security policies are expressed directly in the code—e.g., via
calls on security primitives, as in Decentralized Information Flow
Control (DIFC) systems, such as Asbestos, HiStar, and Flume, or in
capability-based systems, such as Capsicum—the
policy implementation is spread throughout the codebase of an
application. With such a structure, it is difficult to understand what
overall policy the application actually implements, which complicates
documenting the policy, debugging the policy, and updating the policy
as the codebase evolves in response to changes in requirements and the
threat landscape.
In contrast, the focus of our work is on a declarative approach to system security.
In our approach, a policy specification is a first-class notion—e.g.,
the policy specification for a given application is a single
document, so that all elements of the policy are gathered together in
one place, rather than scattered throughout the codebase.
A security policy is still implemented in terms of a collection of
code snippets to invoke policy-enforcement primitives at different
points in the program.
However, users are freed from the task of introducing such snippets
manually.
The goal of our work is to create tools to support policy
weaving, which will handle the job of rewriting the program to place
code at appropriate locations so that suitable security primitives are
invoked at appropriate moments to enforce the policy.
A policy-weaving tool takes as input
(i) a program P, and
(ii) a policy R (written in some suitable policy-specification language),
and rewrites P to enforce R.
A policy-weaving tool supports a declarative approach to security.
The policy R can be stated as a single, easily-understood
document, and it is the obligation of the policy-weaving tool to find
a way to place code for invoking enforcement primitives so that all
executions of P satisfy R.
We also work on meta-tools to support policy weaving.
In particular, one of our goals is to create a policy-weaver generator.
Such a tool takes as input specifications of
- the semantics of a programming language L
- the semantics of the policy-enforcement primitives to be used
and generates a policy weaver.
Our vision is similar to aspect-oriented programming (in that a
cross-cutting issue is specified separately and
then automatically woven into the fabric of the
implementation).
However, the approach that we envision goes beyond
aspect-oriented programming in an important way: whereas conventional
aspect-oriented programming systems use syntactically-oriented
mechanisms and allow aspect code to be placed in a rather limited set
of locations, our approach is based on programming-language and
policy semantics to identify automatically
the program locations where policy-enforcement primitives need to be placed.
Our approach also goes beyond simple finite-state reference monitors:
policies have visibility into such entities as
processes, principals, and communication channels.
A key challenge is to design a policy weaver that can reason about the
potentially-unbounded set of states that a program may reach, and to
choose system primitives and parameters from a potentially unbounded
number of options to keep the program from reaching states that constitute a
violation of the policy provided by the programmer.
Our approach to designing a weaver generator is based on the theory of
two-player safety games.
Such a game is played by two players, called the Attacker and the
Defender, on a graph in which each node is labeled as an Attacker
node or a Defender node.
The state of the game is a node of the graph.
The goal of the Attacker is to reach a state labeled as an
accepting node;
the goal of the Defender is to thwart the Attacker by preventing
the Attacker from ever reaching an accepting node.
The game is played in turns:
on each turn, if the state is an Attacker state, then the Attacker
can change the state to any adjacent node of the graph, and
analogously for Defender states and Defender-initiated state changes.
A winning Defender strategy for a game is a
decision tree of choices that the Defender can make so that an
accepting state is never reached.
The core of a policy weaver created by our weaver generator is an
algorithm to find a winning Defender strategy
[HJR12].
A policy weaver generated by the policy-weaver generator works as follows:
given program P and policy R, the policy weaver constructs a game in
which
- the Attacker represents the threat model (i.e., all possible system attacks),
- the Defender represents possible instrumentations of the program, and
- accepting nodes represent system states reached when the instrumented program
violates the policy.
If the weaver finds a winning Defender strategy for the game, it uses
the strategy to instrument the program to satisfy the policy.
We have also been working on applying policy weaving to JavaScript
programs.
JavaScript is the most popular language used to create dynamic content
for web browsers.
Moreover, the nature of the Web has resulted in an ecosystem in which
JavaScript programs and libraries, often written by untrusted third
parties, are regularly incorporated into applications and combined in
every way imaginable.
While this practice has obvious benefits in terms of convenience and
not ``reinventing the wheel,'' it creates glaring security holes because
of a complete lack of isolation between modules in a JavaScript
environment.
We chose this context to apply our policy-weaving techniques because
of the obviously unfilled need for a security mechanism, and because
of the challenge presented by JavaScript's highly dynamic, untyped
nature.
We have created the the JAM policy weaver [FJJRPSY12],
coupled with the JAMScript enforcement mechanism.
(``JAM'' stands for JAvascipt Model checker.)
JAM can be used by a website administrator to apply a security
policy—in the form of a finite-state automaton—that
specifies a set of disallowed execution traces of a JavaScript
program.
The program is analyzed statically with respect to the policy, and an
instrumented program that is guaranteed to not violate the policy
specification is produced.
This analysis consists of a counterexample-guided,
abstraction-refinement loop that repeatedly creates an incrementally
more precise abstract model of the program.
Because this model represents a superset of the program's actual
behaviors, we can say that any execution traces that are not
realizable in the model also cannot happen in the concrete program.
Static analysis can be expensive, especially in the context of a
highly dynamic language such as JavaScript.
Thus, the algorithm used in JAM provides for a system-resource cutoff threshold,
which terminates the analysis when the threshold is reached.
This case still results in a safely instrumented program, although the
program may contain some amount of spurious instrumentation, resulting
in extra runtime overhead.
In this way, JAM can be tuned to balance the trade-off between
static-analysis overhead and dynamic overhead.
The instrumentation that JAM weaves into a JavaScript program comes in
the form of transactions.
Potentially malicious statements in the untrusted code (i.e., those
that could not be proven safe by the static analysis) are wrapped in
transaction blocks, which are speculatively executed in a manner that
records all actions (reads, writes, and function calls) and suppresses
all effects of the actions.
This approach allows the effects of the statements to be evaluated
with respect to the security policy, at run-time inside the transaction.
The program is terminated—without the effects of the actions
being visible externally—if it is determined that any prefix
of the sequence of actions will violate the policy.
If no policy-violating action sequence is observed, the transaction is
committed, and execution continues as before.
Bibliography:
[Back to the top]
|