Past Research Accomplishments
Table of Contents
Capsule Summaries
During my 46-year career in Computer Science, the overriding goal of my
research has been to find new ways to help programmers create
correct, reliable, and secure software. To address these issues,
I developed a number of new methods to manipulate programs and analyze
their properties. Often these methods have taken the form of
frameworks that apply to general classes of problems, e.g.,
- creating interactive program-development environments
[RT88]
- solving graph-reachability problems
[R98]
- analyzing programs that manipulate linked data structures
[SRW02]
- solving interprocedural dataflow-analysis problems
[RSJM05]
- analyzing machine code
[LR08a]
- synthesizing programs
[KHDR21].
As Richard Hamming has put it, the benefit of developing a solution
in the form of a framework is that it allows the work to address
"all of next year's problems, not just the one [currently] in front of [your] face"
[H86, p. 7].
My research interests have spanned program-development environments,
software engineering, dataflow analysis, and more recently,
computer security, software verification, and program synthesis, with some excursions
into other subjects, as well.
The key achievements can be summarized as follows:
- In 1978-79,
Tim Teitelbaum and I created the Cornell Program
Synthesizer [TR81],
the first example of an interactive program-development environment
for a strongly-typed, block-structured, imperative programming
language. It provided integrated, screen-oriented facilities
to create, edit, execute, monitor, single-step, and debug
programs. The Cornell Program Synthesizer was similar
to modern program-development environments, such as
Microsoft Visual Studio
and Eclipse,
but pre-dated them by over two decades.
[More]
- From 1980 to 1987,
Tim Teitelbaum and I worked on methods to
create programming tools similar to the Cornell Program
Synthesizer automatically, but for languages other than the PL/I
subset that the Cornell Program Synthesizer supported.
We developed an editor-generator tool, the Synthesizer Generator
[RT88],
which automatically creates an editor for a given language
L from a description of L—including a
description of the properties that the editor should check
incrementally as L-documents are developed. Among other
applications, the Synthesizer Generator was used to create some
of the first screen-oriented, interactive, automated proof assistants
[RA84,
G88,
GMP90].
[More]
- In 1988,
Susan Horwitz and I, together with our student David Binkley, showed how to
perform context-sensitive interprocedural slicing
[HRB88].
Our work reignited interest in program slicing in the research
community, which continues to this day. It also led to
the development of a number of commercial program-slicing products.
Commercial slicers include GrammaTech's
CodeSurfer for C and C++,
which, under a license from the Wisconsin Alumni Research Foundation,
is derived from the ANSI C slicer that we developed at Wisconsin, and IBM's
Rational Transformation Workbench,
which was developed to address such code-renovation issues as
the Year 2000 Problem ("Y2K bug") and conversion of financial software
to support adoption of the Euro.
[More]
- In the early 90s,
my student G. Ramalingam and I investigated
incremental algorithms for dynamic graph problems. The goal
was to use the solution to one problem instance
G1 to find the solution to a nearby problem
instance G2, where G2 is
created from G1 by inserting or deleting a
small number of nodes or edges. In 1992, we developed an
incremental algorithm for the dynamic shortest-hyperpath
problem in directed hypergraphs [RR96b].
The algorithm has been used by AT&T in their Open Shortest Path
First (OSPF) Internet intra-domain routing protocol, achieving a
15-fold
to
20-fold
speed-up in the heuristic search performed by the protocol.
More recently, the ideas used in [RR96b]
were combined with the A* search algorithm from the field of AI
planning to create several
incremental versions of A* search.
These algorithms were used to solve planning problems and generate dynamically
constrained trajectories for the full-size SUV of CMU's Tartan Racing Team,
which won the DARPA 2007 Urban Challenge race for autonomous vehicles.
[More]
- From 1993 to 1997,
Susan Horwitz, Mooly Sagiv, and I showed
that the problem of context-free-language reachability
(CFL-reachability) in directed graphs is a key concept in
program analysis
[R98].
(CFL-reachability is the variant of reachability in which a path
from node s to node t only counts as a valid
s-t connection if the path's labels form a word
in a given context-free language.) We showed that a large
class of program-analysis problems could be solved by
transforming them into instances of CFL-reachability.
The idea is now a standard concept in the programming-languages
community, and a CFL-reachability solver is one of the
key analysis components found in most systems for software
model checking and bug detection.
[More]
- In 1999,
Mooly Sagiv, Reinhard Wilhelm, and I developed a
tunable framework for analyzing and establishing properties of
programs that manipulate linked data structures
[SRW02].
The work addressed the problem of analyzing programs that use
dynamic allocation and freeing of storage cells and destructive
updating of structure fields. In such programs, data
structures can grow and shrink dynamically, with no fixed upper
bound on their size or number. Our framework provides a way to
create finite-sized descriptors of memory configurations that
(i) abstract away certain details, but (ii) retain enough key
information so that an analyzer can identify interesting
node-linkage properties that hold (a.k.a. "shape" properties)
[LARSW00,
JLRS04,
LRS06,
JLRS10].
Our work galvanized the programming-languages research
community to finally tackle the problem of pointers, aliasing,
and linked data structures in program-analysis and
program-verification tools, and ignited an enormous effort on
the topic over the last ten years.
[More]
- In 2003,
Somesh Jha, Stefan Schwoon, and I developed a new
framework for interprocedural dataflow analysis
that supports a strictly richer set of queries than previous
approaches to interprocedural dataflow-analysis
(a problem that
goes back to 1974).
The algorithms that we developed
[RSJM05],
along with improvements developed with my students Akash Lal
and Nicholas Kidd
[LRB05,
LR06,
LKRT07,
LR08b],
have been implemented in two libraries for use in creating
program-analysis tools
[KRML04,
KLR07].
[More]
- In 2003,
my student Gogul Balakrishnan and I developed a
static-analysis algorithm for x86 machine code
that—without relying on symbol-table or debugging
information—is able to track the flow of values through memory
[BR04,
BR10].
We used it to create the first program-slicing tool for machine
code that could help with understanding dependences across
memory updates and memory accesses
[BGRT05].
Later, we extended the tool to create the first automatic
program-verification tool for stripped executables: it allows
one to check that a stripped executable conforms to an
API-usage rule (specified as a finite-state machine)
[BRKLLMGYCT05,
BR08].
[More]
- In 2008,
my student Junghee Lim and I created the TSL system
[LR08a,
LR12].
(TSL stands for Transformer Specification Language.)
From a single specification of the concrete
semantics of a machine-code instruction set, TSL automatically
generates correct-by-construction implementations of the
state-transformation functions needed in
state-space-exploration tools that use static analysis, dynamic
analysis, symbolic analysis, or any combination of the three
[LR08a,
LLR09,
LLR10,
TLLBDEAR10].
The generated implementations are guaranteed to be mutually
consistent, and also to be consistent with an instruction-set
emulator that is generated from the same specification of the
concrete semantics.
[More]
- From 2005 to 2009,
my students Akash Lal, Nicholas Kidd, and I developed
methods for analyzing properties of concurrent programs, via
model checking. (Our collaborators included Sagar Chaki, Ed
Clarke, Julian Dolby, Peter Lammich, Tayssir Touili, and
Mandana Vaziri.)
We investigated two different kinds of model-checking problems,
and for each problem the technique that we developed was
about 30 times faster than the best previous technique for the
problem.
One line of research focused on
"context-bounded
analysis": a program's behavior is explored for a fixed
number of context switches (but running threads are allowed to
perform an a priori unbounded amount of work between
context switches). We obtained several fundamental results,
including a way in which any context-bounded-analysis problem can
be transformed into an equivalent sequential problem (so that
existing tools for sequential analysis can then be applied)
[LR08c,
LR09].
Another line of research focused on the problem
of detecting data races that may involve
multiple data locations. (Problems can arise when
a program is supposed to maintain an invariant over multiple
data locations, but accesses/updates are not protected
correctly by locks.) We showed that for a model of
concurrent programs that permits (i) reentrant locks,
(ii) an unbounded number of context switches, and (iii) an
unbounded number of lock acquisitions, the multi-location
data-race-detection problem is decidable
[KLTR09].
[More]
- Since 2002,
I have been interested in an interesting connection between
abstract interpretation and logic—in particular, how to harness
decision procedures to obtain algorithms for several fundamental
primitives used in abstract interpretation
[RSY04,
YRS04,
TR12a,
TR12b,
TER12,
ELSAR13,
TLLR13].
In [RSY04],
Mooly Sagiv, Greta Yorsh, and I gave the first method that makes
such a connection: we showed that, for a broad class of situations, given
a formula φ in some logic L, it is possible to find the best value a in
a given abstract domain Abs that over-approximates the
meaning of φ (i.e., [[φ]] ⊆ γ(a)).
From 2011 to 2016, my student Aditya Thakur and I pursued
several new insights on this fundamental question.
Several of the methods that we have developed are "dual-use".
That is, in addition to providing methods for building improved
abstract-interpretation tools, they also provide methods for building
improved logic solvers that use abstract interpretation to speed up
the search that a solver carries out.
[More]
- From 2010 to 2014,
Somesh Jha, Bill Harris, Rich Joiner, and I worked on policy weaving,
a declarative approach to system security.
In our approach, a policy specification is a first-class notion—e.g.,
the policy specification for a given application is a single
document, so that all elements of the policy are gathered together in
one place, rather than scattered throughout the codebase.
A security policy is implemented in terms of a collection of
code snippets that invoke policy-enforcement primitives at different
points in the program;
however, users are freed from the task of introducing such snippets
manually by a policy-weaving tool, which takes as input
(i) a program P, and
(ii) a policy R (written in some suitable policy-specification language),
and rewrites P to enforce R.
[More]
- Starting in 2011, Emma Turetsky, Prathmesh Prabhu,
and I worked on a method to improve the dataflow-analysis algorithm of Esparza et al.,
which provides a way to apply Newton's Method
to program-analysis problems. They gave a new way to find fixed-points of
systems of equations over semirings.
As in its real-valued counterpart, each iteration of their
method solves a simpler ``linearized'' problem.
However, there is an important difference between the
dataflow-analysis and numerical-analysis contexts: when Newton's
method is used on numerical-analysis problems, multiplicative
commutativity is relied on to rearrange expressions of the form
``a * X * b'' into ``X * a * b.'' Equations of the form
``X = X * a * b'' correspond to path problems described by regular
languages. In contrast, when Newton's method is used for
interprocedural dataflow analysis, the ``multiplication'' operation
involves function composition, and hence is non-commutative:
``X = a * X * b'' cannot be rearranged into ``X = X * a * b,'' and
thus the equations correspond to path problems described by linear
context-free languages (LCFLs).
We used some algebraic sleight-of-hand to turn a class of linear
context-free (LCFL) path problems into regular-language path
problems
[RTP17,
RTP16].
This result is surprising because a reasonable sanity
check—formal-language theory—suggests that it should
be impossible: after all, the LCFL languages are a strict superset
of the regular languages.
Zak Kincaid, Jason Breck, John Cyphert, and I have continued
this line of research, and developed a number of additional
techniques
[KBFBR17,
KCBR18,
KBCR19,
CBKR19,
BCKR20,
KRC21].
[More]
- Since 2019, I have been collaborating
with Loris D'Antoni and his students (Qinheping Hu, Jinwoo Kim, Kanghee Park, and
others) on a new framework for program synthesis, which we call
``semantics-guided synthesis'' (SemGuS).
The SemGuS framework captures the core computational problem of
program synthesis, without being tied to a specific solution,
implementation, or domain of application.
That is, it is both domain-agnostic and solver-agnostic.
What the SemGuS work brings to the table is the idea of
parameterizing the core synthesis algorithm with the semantics
of the primitives out of which the target program is to be
constructed.
To achieve this goal, it is necessary to
distill out the essence of many program-synthesis problems into
a specification formalism that is grounded in formal methods
(e.g., automata and logic) and is agnostic to any specific
domain of application. This degree of abstraction also opens
the opportunity to lift certain synthesis algorithms and ideas
to a higher level that makes these algorithms reusable across
different tools.
[More]
-
Meghana Sistla,
Swarat Chaudhuri, and I
are working on a data structure that can provide highly compressed representations
of Boolean and pseudo-Boolean functions.
Our data structure, called a CFLOBDD, is a decision diagram—and thus related
to a Binary Decision Diagram (BDD)—but with one feature that is quite
different than anything found in BDDs (and the many variants of BDDs that have been developed).
A BDD can be considered to be a special form of bounded-size, branching, but non-looping
program.
From that standpoint, a CFLOBDD can be considered to be a bounded-size, branching,
but non-looping program in which a certain form of procedure call is permitted.
In the best case, CFLOBDDs can obtain double-exponential compression
compared to decision trees—i.e., an extra exponential factor over (quasi-)BDDs.
Even when such best-case compression is not possible, such ``procedure calls'' can
allow there to be additional sharing of structure beyond what is possible in BDDs:
a BDD can only share sub-DAGs, whereas a procedure call in a CFLOBDD can also share
the ``middle of a DAG.''
We have found that CFLOBDDs can be quite good for simulating quantum circuits.
[More]
Along the way, we made several contributions to the field of
incremental computation
[RTD83,
RR94a,
RR94b,
RR96a,
RR96b,
RSL03,
RSL10,
EGR10];
introduced canonical abstraction, a new family of
abstractions for use in program analysis and verification
[SRW02];
made numerous innovations in techniques for program analysis
[RR95,
HRS95,
RT96,
MR98,
RBRSW05,
LRB05,
LR06,
BR06,
GR07,
LKRT07,
LTKR08,
LR08b,
KLR08,
LR08c,
KRDV09,
LR09,
KLTR09];
developed techniques to
convert C programs to C++ programs that
take advantage of the improved features of C++,
such as templates and classes;
worked on methods to address the Year 2000 problem;
and identified several interesting connections between program-analysis
problems and ideas from databases
[R94,
RSL03,
RSL10,
EGR10],
machine learning
[RSY04,
YRS04,
LRS05,
TER12],
and access control of shared computing resources in distributed systems
[SJRS03,
JR04,
JSWR06].
[Back to the top]
Interactive Program-Development Environments
The Cornell Program Synthesizer
The Cornell Program Synthesizer [TR81],
which Tim Teitelbaum
and I started working on in 1978, was the first example of an
interactive program-development environment with integrated, screen-oriented
facilities for a strongly-typed, block-structured, imperative
programming language. It had a tightly
integrated combination of facilities for programmers to create, edit,
execute, monitor, single-step, and debug their PL/C programs. In other words,
by 1979, we had created the first version of an interactive
programming environment similar to modern tools, such as
Microsoft Visual Studio
and Eclipse.
At the time that it was created (1978-79), the Cornell Program Synthesizer
was a radically new kind of programming tool. Along with the Mentor
system of Donzeau-Gouge, Kahn, Huet, and Lang
[DGKHL80],
it showed that the kind of direct-manipulation programming tools that had been
pioneered in the SmallTalk
and InterLisp systems
could be created for a mainstream programming language (i.e., one that was
strongly-typed, block-structured, and had imperative programming constructs).
The Cornell Program Synthesizer was used successfully in introductory
programming classes by about 20,000 students over a five-year period
at Cornell, Rutgers, Princeton, UCLA, and several other universities.
Technology Transfer
The Cornell Program Synthesizer was distributed to researchers at
about 110 universities, research institutes, and industry
laboratories.
Through lectures that were given about the system and copies that were
distributed, it is possible to trace its direct influence on several
commercial products, including DEC's
Language Sensitive Editor,
Rational's Ada programming environment,
and Symantec's
LightspeedC and LightspeedPascal programming environments.
It is more difficult to connect the dots to the similar products that
hundreds of thousands or millions of people use on a daily basis today;
however, together with the initial paper about the system
[TR81],
the Cornell Program Synthesizer was very influential: as a model for
programming tools of the future, it helped to spark widespread
interest in the question of how to create better tools for developing,
analyzing, testing, and debugging programs.
[Back to the top]
The Synthesizer Generator
My Ph.D. dissertation [R84],
which was honored with the 1983
ACM Doctoral Dissertation Award,
investigated the problem of automating the creation of programming
tools similar to the Cornell Program Synthesizer, but for languages
other than PL/I.
The goal of the research was to create an editor-generator tool to
enable editors for different languages to be created automatically
from a language description. In the solution I proposed, language
descriptions are specified by giving
- a context-free grammar for the textual input language,
- a regular-tree grammar for the language's underlying abstract syntax,
- sets of "attributes" (annotations) to be attached to nodes of the
language's abstract-syntax trees,
- rules that describe how information from one node's attributes
affects the attributes of other nodes, and
- pretty-printing rules that describe how abstract-syntax trees
and their attributes are displayed.
In essence, each editor can be thought of as a specialized
"spreadsheet" for documents written in the given language. After an
editing operation changes the document's underlying abstract-syntax
tree, some of the
tree's attributes may no longer be consistent with one another;
attributes must be updated to reestablish consistency (similar to the
way values are updated in a spreadsheet program). A major result of
my Ph.D. research was an optimal algorithm that I devised to solve the
attribute-updating problem
[R82,
RTD83,
R84].
As part of my Ph.D. research, I developed a prototype implementation
of an editor-generating system that made use of the ideas described
above. I spent 1982-85 as a post-doc at Cornell (in residence during
1982-83 at INRIA, the national institute for computer science in
France) collaborating with Tim Teitelbaum on the design and
implementation of the Synthesizer Generator, a full-scale version of
such a system
[RT84,
RT87,
RT88a].
The Synthesizer Generator creates a language-specific editor from an
input specification that defines a language's syntax, display format,
transformation rules for restructuring "documents" written in the
language, and attribution rules that codify context-sensitive
relationships between document elements. From this specification, the
Synthesizer Generator creates a specialized editor for manipulating
documents according to these rules. The Synthesizer Generator has
proven to be of great utility in a wide range of applications,
including program editors, text-preparation systems, and tools that
verify the correctness of proofs (for several different kinds of
mathematical and programming logics
[RA84,
G88,
GMP90]).
In 2010, our 1984 paper about the Synthesizer Generator
[RT84] received an
ACM SIGSOFT Retrospective Impact Paper Award.
Technology Transfer
Beginning in August 1985, the Synthesizer Generator was made available
to universities and industrial research organizations.
Since distribution began it has been licensed
to about 330 sites in 23 countries. Two books describing the system,
which Teitelbaum and I co-authored, were published by Springer-Verlag
in 1988 [RT88a,
RT88b].
Also in 1988, Teitelbaum and I founded a company
(GrammaTech, Inc.) to
commercialize the Synthesizer Generator technology.
GrammaTech subsequently created two products for the Ada programming language
based on the Synthesizer Generator,
Ada-ASSURED
and Ada-Utilities,
and, since the mid-90's, these have been used in the development of almost every major
Department of Defense system written in Ada.
[Back to the top]
Software Engineering
Program Slicing
Starting in 1986, I carried out a great deal of work exploring how
program slicing
(an operation originally proposed in 1979 by the late Mark Weiser)
can serve as the basis for semantics-based program-manipulation operations.
The slice of a program with respect to a set of program elements
S is
a projection of the program that includes only program elements that
might affect (either directly or transitively) the values of the
variables used at members of S. Slicing allows one to find
semantically meaningful decompositions of programs, where the
decompositions consist of elements that are not textually contiguous.
Program slicing is a fundamental operation that can aid in solving
many software-engineering problems
[HR92].
For instance, it has applications in program understanding, maintenance, debugging,
testing,
differencing [HR91],
specialization [RT96],
reuse, and merging of program variants
[HPR89].
The projects that I carried out with my students and co-workers were aimed at
(Click here
for the home page of the Wisconsin Program-Slicing Project.)
The conference paper on interprocedural slicing that
Susan Horwitz,
David Binkley,
and I wrote [HRB88]
was selected for inclusion in a special SIGPLAN collection of the
50 most influential papers from the SIGPLAN Conference on Programming
Language Design and Implementation from 1979 to 1999:
-
Horwitz, S., Reps, T., and Binkley, D.,
Interprocedural slicing using dependence graphs.
20 Years of the ACM SIGPLAN Conference on Programming Language Design
and Implementation (1979 - 1999): A Selection, K.S. McKinley, ed.,
ACM SIGPLAN Notices 39, 4 (April 2004), 232-243.
A journal version of
[HRB88]
appeared as
[HRB90].
(Of all my papers, this is the most highly-cited;
e.g., see
Google Scholar
and CiteseerX.)
A retrospective on [HRB88]
was published as
-
Horwitz, S., Reps, T., and Binkley, D.,
Retrospective: Interprocedural slicing using dependence graphs.
20 Years of the ACM SIGPLAN Conference on Programming Language Design
and Implementation (1979 - 1999): A Selection, K.S. McKinley, ed.,
ACM SIGPLAN Notices 39, 4 (April 2004), 229-231.
[retrospective (PS);
retrospective (PDF)]
In 2011, our 1994 paper "Speeding up slicing"
[RHSR94] received an
ACM SIGSOFT Retrospective Impact Paper Award.
In more recent work, my students, colleagues, and I have developed even
better methods for program analysis and slicing based on a formalism called
weighted pushdown systems.
Technology Transfer
In collaboration with Genevieve Rosay and Susan Horwitz,
I built a system, called the
Wisconsin Program-Slicing Tool,
that implemented slicing of ANSI C programs.
Although the Wisconsin Program-Slicing Tool
was distributed from 1996-2000 for not-for-profit
research purposes under license from the University of Wisconsin,
it is no longer being distributed.
However, there is a commercial tool available, named
CodeSurfer,
that is derived from the Wisconsin implementation and is available from
GrammaTech, Inc.,
which licensed the technology from the Wisconsin Alumni Research Foundation (a
not-for-profit foundation that uses royalty income from the
technologies it licenses to support research at the University of
Wisconsin).
GrammaTech has improved on the Wisconsin implementation considerably
(both in terms of speed and space efficiency), and extended it to
support C++.
GrammaTech also offers
CodeSurfer to academic researchers
on very favorable terms.
In 2007, Tavis Ormandy of Gentoo Linux said the following about CodeSurfer:
"Our team is responsible for investigating and verifying security problems
in any of the 10,000+ open source programs we distribute. CodeSurfer has
made a dramatic impact on the time it takes us to isolate and understand
security flaws, allowing us to traverse complex and unfamiliar codebases
easily and trace problems to their source."
A group at SAIC/NASA had this to say about CodeSurfer:
"Our group analyzes many mission-critical software projects to
reduce defects and ensure that the software meets requirements. We
conducted a formal study to see if CodeSurfer could improve our
software inspections. We found that CodeSurfer reduced inspection
time by an average of 30%. In addition, when using CodeSurfer the
number of defects found more than doubled."
Slicing was also used in many commercial tools that were developed to address
such code-renovation issues as the Year 2000 Problem ("Y2K bug") and
conversion of financial software to support adoption of the Euro.
One example is IBM's
Rational Transformation Workbench.
(A list of more than 20 such tools can be found
here.)
[Back to the top]
Techniques for Software Renovation
My student Mike Siff and I investigated techniques for converting C programs
to C++ programs that take advantage of the improved features
of C++, such as templates and classes.
For instance, in
[SR96]
we considered the problem of software generalization: Given a program
component C, create a parameterized program component
C' such that C' is usable in a wider variety of
syntactic contexts than C. Furthermore, C' should
be a semantically meaningful generalization of C; namely,
there must exist an instantiation of C' that is equivalent in
functionality to C.
We gave an algorithm that generalizes C functions via type inference:
the original functions operate on specific data types; the result of
generalization is a collection of C++ function templates that operate
on parameterized types.
We also described a general technique for identifying modules in legacy
code [SR99].
The method is based on concept analysis—a branch of lattice
theory that can be used to identify similarities among a set of
objects based on their attributes. We showed how concept analysis can
identify potential modules using both "positive" and "negative"
information, and presented an algorithmic framework to construct a
lattice of concepts from a program, where each concept represents a
potential module.
The development of better techniques for software renovation
was the subject of Mike Siff's Ph.D. dissertation
[S98].
[Back to the top]
The Use of Program Profiling for Software Maintenance
In 1996, I was asked by the Defense Advanced Research Projects Agency
(DARPA) to plan a project aimed at reducing the impact of the Year
2000 problem on the Department of Defense. DARPA was particularly
interested in whether there were "any techniques in the research
community that could be applied to the Y2K problem and have impact
beyond present commercial Y2K products and services." I spent a
semester as a kind of talent scout for DARPA to see what ideas might
be worth pursuing. The most exciting of the ideas that turned up
concerned a method for using path profiling as a heuristic to locate
some of the sites in a program where there are problematic date
manipulations [RBDL97]
(and more generally as an aid for testing and debugging).
A path profiler instruments a program so that the number of times each
different loop-free path executes is accumulated during an execution
run. With such an instrumented program, each run of the program
generates a path spectrum for the execution—a distribution of the
paths that were executed during that run. A path spectrum is a
finite, easily obtainable characterization of a program's execution on
a dataset, and provides a behavior signature for a run of the program.
Our techniques are based on the idea of comparing path spectra from
different runs of the program. When different runs produce different
spectra, the spectral differences can be used to identify paths in the
program along which control diverges in the two runs. By choosing
input datasets to hold all factors constant except one, the divergence
can be attributed to this factor. The point of divergence itself may
not be the cause of the underlying problem, but provides a starting
place for a programmer to begin his exploration.
One application of this technique is in the "Year 2000 Problem" (i.e.,
the problem of fixing computer systems that use only 2-digit year
fields in date-valued data). In this context, path-spectrum
comparison provides a heuristic for identifying paths in a program
that are good candidates for being date-dependent computations.
A few years later, we found out that essentially the same idea had been discovered
earlier by J. Collofello and L. Cousin [CC87],
as well as by N. Wilde
(who has done extensive research on the subject, and uses the term
"software reconnaissance",
e.g., see [W94,
WC96]).
[Back to the top]
Dataflow Analysis
Interprocedural Dataflow Analysis and Context-Free-Language Reachability
Work that I carried out with Susan Horwitz, Mooly Sagiv, and David
Melski established some interesting connections between the work on
interprocedural program slicing and a number of other program-analysis
problems, such as classical interprocedural dataflow analysis.
In particular, we showed that a large class of program-analysis
problems can be solved by transforming them into instances of a
special kind of
graph-reachability problem,
called context-free-language reachability (CFL-reachability)
[RHS95,
HRS95,
R95b,
R96,
MR98,
R98,
R00].
CFL-reachability is the variant of reachability in which a path
from node s to node t only counts as a valid
s-t connection if the path's labels form a word
in a given context-free language.
The notion of CFL-reachability was originated by
Mihalis Yannakakis,
who developed it in the context of
graph-theoretic methods in database theory.
Context-free-language reachability problems can be solved in
polynomial time by algorithms similar to the ones we originally developed
for interprocedural slicing [HRB90,
RHSR94].
David Melski and I studied the relationship [MR98]
between CFL-reachability and the program-analysis formalism of set constraints, which was
introduced by John Reynolds
and subsequently studied by
Alex Aiken and
Nevin Heintze.
The PowerPoint presentation prepared for the tutorial that I gave on
this material at PLDI '00 can be found
here.
In more recent work, my students, colleagues, and I have developed even
better methods for interprocedural dataflow analysis based on a formalism called
weighted pushdown systems.
Technology Transfer
In addition to the CFL-reachability-based slicing algorithms that are
used in
CodeSurfer,
the work on program analysis via CFL-reachability had direct impact on
many systems for software model checking and bug detection:
a CFL-reachability solver is one of the key analysis components used in most
such systems. Among them is the well-known
SLAM tool from
Microsoft, which checks C programs against temporal safety properties
to find defects in Windows device drivers—thereby addressing one
of Windows' main weak points.
In particular, as described in
[BR01],
one of the key algorithms used in SLAM is based on the interprocedural dataflow-analysis
algorithm given in [RHS95].
Due to the importance of the problem that it addressed,
the SLAM work even got on Bill Gates's radar screen.
For instance, in a keynote address at the Windows Hardware Engineering
Conference in 2002, Gates touted the work on SLAM as follows:
"… software verification … has been the Holy Grail of Computer
Science for many decades, but now in some very key areas, for
example, driver verification, we're building tools that can do
actual proof about the software and how it works in order to
guarantee the reliability."
(See also
Gates's address at OOPSLA 2002.)
Thus, as one of the key technologies in a tool that has helped to make
Windows much more stable in recent years, CFL-reachability has had a
significant impact on the computing experience of hundreds of millions
of people worldwide.
[Back to the top]
Static Program Analysis Via 3-Valued Logic (Shape Analysis)
The technique for static program analysis using 3-valued logic
that Mooly Sagiv, Reinhard Wilhelm, and I developed
[SRW02,
RSW04]
addresses the problem of analyzing programs that use dynamic
allocation and freeing of storage cells and destructive updating of
structure fields—features found in all modern programming
languages, including C, C++, and Java. In such programs, data
structures can grow and shrink dynamically, with no fixed upper bound
on their size or number. In the case of thread-based languages, such
as Java, the number of threads can also grow and shrink dynamically.
We developed a way to create finite-sized descriptors of
memory configurations that (i) abstract away certain details, but (ii)
retain enough key information so that the analysis can identify
interesting properties that hold.
We used these ideas to create a framework for program
analysis that is capable of expressing rich properties of memory
configurations, and confirming such behavioral properties as
- when the input to a list-insert program
is an acyclic list, the output is an acyclic list, and
- when the input to a list-reversal program that uses
destructive-update operations is an acyclic list, the output is an
acyclic list.
The framework can be instantiated in different ways to create
different program-analysis algorithms that provide answers to
different questions, with varying degrees of efficiency and precision.
The work was originally motivated by the problem of
shape analysis
for programs written in languages that permit destructive updating
of dynamically allocated storage.
The aim of shape analysis is to discover information
(in a conservative way) about the possible "shapes"
of the dynamically allocated data structures to which a program's pointer
variables can point.
As we investigated this problem more deeply, we discovered
methods that allowed us to deal with properties of memory configurations
other than just pure shape properties.
For instance, recent work has concerned numeric properties
[GDDRS04].
The key aspect of our approach is the way in which it makes use of
2-valued and 3-valued logic: 2-valued and 3-valued
logical structures—i.e., collections of relations—are used to
represent concrete and abstract stores, respectively;
individuals represent entities such as memory cells, threads,
locks, etc.;
unary and binary relations encode the contents of
variables, pointer-valued structure fields, and other aspects of
memory states; and first-order formulas with transitive closure are used
to specify properties such as sharing, cyclicity, reachability, etc.
Formulas are also used to specify how the store is affected by the
execution of the different kinds of statements in the programming
language.
The analysis framework can be instantiated in different ways by
varying the relation symbols of the logic, and, in particular, by varying
which of the unary relations control how nodes are folded
together. The specified set of relations
determines the set of properties that will be tracked,
and consequently what properties of stores can be discovered to
hold at different points in the program by the corresponding instance
of the analysis.
As a methodology for verifying properties of programs, the advantages
of the 3-valued-logic approach are:
-
No loop invariants are required.
-
No theorem provers are involved, and thus every abstract execution
step must terminate.
-
The method is based on
abstract interpretation
and satisfies conditions that guarantee that the entire process
always terminates.
-
The method applies to programs that manipulate pointers and
heap-allocated data structures.
-
The method eliminates the need for the user to write the usual
proofs required with abstract interpretation—i.e., to
demonstrate that the abstract descriptors that the analyzer
manipulates correctly model the actual heap-allocated data
structures that the program manipulates.
A PowerPoint presentation on some of this material can be found
here.
A prototype implementation that implements this approach has been
created, called TVLA
(Three-Valued-Logic Analyzer).
In addition to
[SRW02,
RSW04],
some of the other key papers about this work include those on
interprocedural shape analysis
[JLRS04,
JLRS08],
numeric properties together with shape properties
[GDDRS04],
applications to program verification
[LRSW00,
LRS06,
LRS07,
ARRSY07],
the use of ideas from incremental computing and databases
to construct abstract transformers
[RSL03,
RSL10,
EGR10],
and the use of machine learning to identify appropriate abstractions
[LRS05].
The Ph.D. dissertations of Alexey Loginov
[L06]
and Denis Gopan
[G07]
developed improved techniques for static analysis based on 3-valued logic.
[Back to the top]
Interprocedural Dataflow Analysis and Weighted Pushdown Systems
The framework for interprocedural dataflow analysis that
Somesh Jha, Stefan Schwoon, and I developed, based on
weighted pushdown systems [RSJM05],
supports a strictly richer set of queries than previous approaches to
interprocedural dataflow-analysis (a problem that has a
35-year
history).
In general, the approach can be applied to any distributive
dataflow-analysis problem for which the domain of transfer functions
has no infinite descending chains (including problems that cannot
be expressed as bit-vector problems, such as linear constant propagation
[SRH96]
and affine-relation analysis
[MOS04]).
Safe solutions are also obtained for problems that are monotonic but
not distributive.
The benefits of the approach can be summarized as follows:
- Conventional dataflow-analysis algorithms merge together the
values for all states associated with the same program point,
regardless of the states' calling context. With the
dataflow-analysis algorithm obtained via WPDSs, dataflow
queries can be posed with respect to a regular language of
stack configurations. (Conventional merged dataflow
information can also be obtained by issuing appropriate
queries.)
- Because the algorithm for solving weighted path problems on
WPDSs can provide a set of witness paths that justify an
answer, it is possible to provide an explanation of why the
answer to a dataflow query has the value reported.
Weighted pushdown systems are also a useful formalism in a completely
separate application domain, namely, access control of shared computing resources
in distributed systems
[JR02,
SJRS03,
JR04,
JSWR06,
WJRSS06].
Technology Transfer
The algorithms described in
[RSJ03,
RSJM05], along with
the improvements described in
[LRB05,
LR06,
LKRT07,
LR08b],
have been implemented in two libraries for use in creating
program-analysis tools
[KRML04,
KLR07].
These libraries have been used to create prototype implementations of
context-sensitive interprocedural dataflow analyses for uninitialized
variables, live variables, linear constant propagation, and the
detection of affine relationships, as well as many other problems.
[Back to the top]
Analysis of Executables
In our work on static analysis of stripped executables, we devised
methods that—without relying on symbol-table or debugging
information—recover intermediate representations that are
similar to those that the intermediate phases of a compiler create for a
program written in a high-level language.
The goal is to provide a platform that an analyst can
use to understand the workings of COTS components, plugins, mobile
code, and DLLs, as well as memory snapshots of worms and
virus-infected code.
Starting from a stripped executable, our algorithms identify variables, data
objects, and types, and track how they are manipulated by the
executable. This information provides answers to such questions as
"what could this dereference operation access?", "what function
could be called at this indirect call site?", etc.
Our initial work concerned x86 executables.
In 2003, my student Gogul Balakrishnan and I developed a
static-analysis algorithm for x86 machine code that is able to track
the flow of values through memory
[BR04].
We incorporated this algorithm, along with algorithms to identify
variables, data objects, and types, in CodeSurfer/x86
[BGRT05]—the
first program-slicing tool for machine code that can help
with understanding dependences across memory updates and memory
accesses.
Later, we extended CodeSurfer/x86 to create the first automatic
program-verification tool for stripped executables: it allows one to
check that a stripped executable conforms to an API-usage rule
(specified as a finite-state machine)
[BRKLLMGYCT05, BR08].
More recently, our work on the TSL system
has allowed us to create similar systems for other instruction sets
automatically, starting from a specification of the concrete
semantics of a machine-code instruction set.
Static analysis of machine code was the subject of Gogul
Balakrishnan's Ph.D. dissertation
[B07].
See also our ACM TOPLAS paper about the work
[BR10].
(The 2004 paper about the work [BR04]
received the ETAPS Best-Paper Award for 2004 from the
European Association for Programming Languages and Systems (EAPLS).)
[Back to the top]
Transformer Specification Language (TSL)
In 2008, my student Junghee Lim and I created the TSL system
[LR08a,
LR12],
a meta-tool to help in the creation of tools for analyzing machine
code. (TSL stands for Transformer Specification
Language.)
From a single specification of the concrete
semantics of a machine-code instruction set, TSL automatically
generates correct-by-construction implementations of the
state-transformation functions needed in
state-space-exploration tools that use static analysis, dynamic
analysis, symbolic analysis, or any combination of the three
[LR08a,
LLR09a,
LLR10,
TLLBDEAR10].
The TSL language is a strongly typed, first-order functional language
with a datatype-definition mechanism for defining recursive datatypes,
plus deconstruction by means of pattern matching. Writing a TSL
specification for an instruction set is similar to writing an
interpreter in first-order ML. For instance, the specification of an
instruction set's concrete semantics is written as a TSL function
state interpInstr(instruction I, state S) {...};
where instruction and state are user-defined
datatypes that represent the instructions and the semantic states,
respectively.
TSL's meta-language provides a fixed set of base-types; a fixed set of
arithmetic, bitwise, relational, and logical operators; and a facility
for defining map-types. The meanings of the meta-language constructs can be
redefined by supplying alternative interpretations of them.
When semantic reinterpretation is performed in this
way—namely, on the operations of the meta-language—it
is independent of any given instruction set. Consequently,
once a reinterpretation has been defined that reinterprets
TSL in a manner appropriate for some state-space-exploration
method, the same reinterpretation can be applied to
each instruction set whose semantics has been specified
in TSL.
The reinterpretation mechanism causes TSL to act as a
"Yacc-like" tool for automatically generating correct-by-construction
implementations of the state-transformation functions needed in
state-space-exploration tools.
Moreover, because all components are generated from
a single specification of the concrete
semantics of an instruction set,
the generated implementations are guaranteed to be mutually
consistent (and also to be consistent with an instruction-set emulator
that is generated from the same specification of the concrete
semantics).
TSL was the subject of Junghee Lim's Ph.D. dissertation
[L11].
(The 2008 paper about TSL [LR08a]
received the ETAPS Best-Paper Award for 2008 from the
European Association for Programming Languages and Systems (EAPLS).)
[Back to the top]
Symbolic Methods for Abstract Interpretation
Since 2002, I have been interested in symbolic methods for abstract
interpretation. The work aims to bridge the gap between (i) the use of logic
for specifying program semantics and performing program analysis, and
(ii) abstract interpretation
[RSY04,
YRS04,
TR12a,
TR12b,
TER12,
ELSAR13,
TLLR13].
In [RSY04],
Mooly Sagiv, Greta Yorsh, and I gave the first method that makes
such a connection: we showed that, for a broad class of situations, given
a formula φ in some logic L, it is possible to find the best value a in
a given abstract domain Abs that over-approximates the
meaning of φ (i.e., [[φ]] ⊆ γ(a)).
We call this problem symbolic abstraction.
If you think of logic L as being a rich language for
expressing meanings, and abstract domain Abs as
corresponding to a logic fragment L' that is less
expressive than L, then symbolic abstraction addresses a
fundamental approximation problem:
Given φ ∈ L, find the strongest
consequence of φ that is expressible in logic fragment
L'.
From 2011 to 2016, my student Aditya Thakur and I pursued the
insight that generalized versions of an old, and not widely used, method
for validity checking of propositional-logic formulas, called
Stålmarck's method, hold great promise for symbolic abstraction.
The symbolic methods for abstract interpretation that we have
subsequently developed based on this insight
[TR12a,
TR12b,
TER12,
TLLR13]
offer much promise for building more powerful program-analysis tools.
These methods (i) allow more precise implementations
of abstract-interpretation primitives to be created, and (ii) drastically
reduce the time taken to implement such primitives because correctness
is ensured by construction.
In particular, they provide a way to create correct-by-construction abstract
interpreters that also attain the fundamental limits on precision that
abstract-interpretation theory establishes.
For instance, in [TLLR13],
we described a systematic method for implementing an abstract interpreter
that solves the following problem:
Given program P, and an abstract domain A,
find the most-precise inductive A-invariant for P.
Moreover, our generalized-Stålmarck methods are "dual-use":
In addition to providing methods for building improved
abstract-interpretation tools, they also provide methods for building
improved logic solvers that use abstract interpretation to speed up
the search that a solver carries out.
In particular, we showed how Stålmarck's method is an instantiation of
a generic framework parameterized by an abstract domain over Booleans;
consequently, different instantiations of the framework lead to new
decision procedures for propositional logic
[TR12b].
Moreover, this abstraction-based view allows us to lift
Stålmarck's method from propositional logic to richer logics:
to obtain a method for richer logics, instantiate the parameterized
version of Stålmarck's method with richer abstract domains
[TR12a].
We call such a decision-procedure design, which is parameterized by an abstract
domain, a Satisfiability Modulo Abstraction (SMA) solver
[RT14].
Recently, we created an SMA solver for separation logic
[TBR14].
Separation logic (SL) is an expressive logic for reasoning about
heap structures in programs, and provides a mechanism for concisely describing program
states by explicitly localizing facts that hold in separate regions of
the heap.
SL is undecidable in general, but by using an abstract domain of
shapes [SRW02]
we were able to design a semi-decision procedure for SL.
The design of algorithms for symbolic abstraction and their
application was the topic of Aditya Thakur's Ph.D. dissertation
[T14].
[Back to the top]
Newton's Method for Program Analysis
Recently, Esparza et al. generalized Newton's method—the classical
numerical-analysis algorithm for finding roots of real-valued
functions—to a method for finding fixed-points of systems of
equations over semirings. Their method provides a new way to solve
interprocedural dataflow-analysis problems. As in its real-valued
counterpart, each iteration of their method solves a simpler
``linearized'' problem.
However, there is an important difference between the
dataflow-analysis and numerical-analysis contexts: when Newton's
method is used on numerical-analysis problems, multiplicative
commutativity is relied on to rearrange expressions of the form
``a * X * b'' into ``X * a * b.'' Equations of the form
``X = X * a * b'' correspond to path problems described by regular
languages. In contrast, when Newton's method is used for
interprocedural dataflow analysis, the ``multiplication'' operation
involves function composition, and hence is non-commutative:
``X = a * X * b'' cannot be rearranged into ``X = X * a * b,'' and
thus the equations correspond to path problems described by linear
context-free languages (LCFLs).
What we contributed in
[RTP17,
RTP16]
is a way to convert each LCFL problem into a
regular-language problem. Given that the language {aibi} is the
canonical example of an LCFL that is not regular, this transformation
sounds impossible. The secret is that we require the
semiring to support a few additional operations (what we call
``transpose,'' ``tensor product,'' and ``detensor'').
Moreover, one has such operations for the
so-called ``predicate-abstraction problems'' (an important family of
dataflow-analysis problems used in essentially all modern-day software
model checkers). In predicate-abstraction problems,
- a semiring value is a square Boolean matrix
- the product operation is Boolean matrix multiplication
- the sum operation is pointwise "or"
- transpose (denoted by t) is matrix transpose, and
- tensor product is Kronecker product of square Boolean matrices
The key step is to take each equation right-hand side of the form
``X = a1 * X * b1 + a2 * X * b2'' and turn it into ``Z = Z * (a1t tensor b1) + Z * (a2t tensor b2).'' The
reason this works is that linear paths in the Z system mimic
derivation trees in the X system, e.g.,
(a1t tensor b1) * (a2t tensor b2) = (a1t * a2t) tensor (b1 * b2)
= (a2 * a1)t tensor (b1 * b2)
^ ^ ^ ^
| | | |
| +-------------+ |
+-----------------------+
which produces the kind of mirrored symmetry that we need to properly
track the values that arise in the X system, which have such symmetric
correlations. (More precisely, the detensor operation performs
detensor(at tensor b) = a * b,
so that
detensor((a2 * a1)t tensor (b1 * b2)) = a2 * a1 * b1 * b2,
which has the mirrored symmetry found in the values of derivation
trees in the X system.)
The Z system's paths encode all and only the derivation trees of the X
system. To use this idea to solve an LCFL path problem precisely,
there is one further requirement: the ``detensor'' operation must
distribute over the tensored-sum operation. This property causes the
detensor of the sum-over-all-Z-paths to equal the desired
sum-over-all-X-tree-valuations.
It turns out that such a distributive detensor operation exists for
Kronecker products of Boolean matrices, and thus all the pieces fit
together for the predicate-abstraction problems.
In addition to the use in Newtonian Program Analysis, we know of two
other applications of the rearrangement trick:
It was originally used in
LTKR08
in a method for context-bounded analysis of concurrent programs;
although we used a different method in our work on abstract error projection
LKRT07,
the rearrangement trick could have been used, and would have improved the result.
Zak Kincaid, Jason Breck, John Cyphert, and I have continued this line of research, and
developed a number of additional techniques that have been incorporated in a system
called ICRA (Interprocedural Compositional Recurrence Analysis),
which finds invariants in numeric programs
[KBFBR17,
KCBR18,
KBCR19,
CBKR19,
BCKR20,
KRC21].
That work introduces versions of ``transpose,'' ``tensor product,'' and ``detensor''
for use with logical formulas.
[Back to the top]
Program Synthesis
Program synthesis refers to the task of finding a program within a
given search space that meets a given behavioral specification
(typically a logical formula or a set of input-output examples).
The general problem is as follows:
Given a specification S that relates input values and output
values, create a program P that satisfies S—i.e., for all
inputs x, the input/output pair (x, P(x)) satisfies S.
The benefit of creating software via a synthesis tool is that when
the tool produces a program that satisfies S, the program is
correct by construction.
In our work, we are developing a new synthesis framework, called ``semantics-guided synthesis''
(SemGuS) that attempts to be both language-agnostic and solver-agnostic.
The goal of SemGuS is to provide a general, logical framework that expresses
the core computational problem of program synthesis, without being tied to
a specific solution or implementation.
What the SemGuS work brings to the table is the idea of
parameterizing the core synthesis algorithm with the semantics
of the primitives out of which the target program is to be
constructed—hence the name ``semantics-guided synthesis.''
In addition to a syntactic search space and a behavioral
specification, SemGuS also allows the user to define the semantics of
constructs in the grammar in terms of a set of inference rules.
The SemGuS framework is described in
[KHDR21,
DHKR21],
and the SemGuS page is here: semgus.org.
Along the way, we obtained new results on how to show that a program-synthesis
problem is ``unrealizable'' (i.e., the problem has no solution).
While many program synthesizers are good at finding a solution when one exists,
they typically fail, or do not terminate, when the given synthesis problem
is unrealizable.
The lack of knowledge about this problem was an important lacuna in the field:
the key to an efficient synthesis algorithm is pruning of the search space;
in particular, when synthesis subproblems are unrealizable, one wants
that part of the search space to be pruned so that the synthesizer will never again
look for a solution there—because none is to be found.
Our work has started to fill this gap by introducing multiple approaches
for proving that synthesis problems are unrealizable
[HBCDR19,
HCDR21b].
We also developed techniques for synthesizing programs that meet a
user-specified asymptotic-complexity bound
[HCDR21a].
[Back to the top]
Model Checking Concurrent Programs
Context-Bounded Analysis
In context-bounded analysis, a program's behavior is explored for a
fixed number of context switches; however, running threads are allowed to
perform an a priori unbounded amount of work between context
switches. Context-bounded analysis was introduced by Shaz Qadeer and
Jakob Rehof; however, their algorithm had two drawbacks: (i) it was limited
to Boolean abstract domains (i.e., predicate-abstraction domains), and
(ii) it performed certain enumerations of reachable configurations
during the process of state-space exploration. In 2008, Akash Lal,
Tayssir Touili, Nicholas Kidd, and I showed how symbolic methods
could be employed in context-bounded analysis. In particular,
we gave transducer-based techniques that permit the
explicit-enumeration step of the Qadeer-Rehof algorithm to be avoided
[LTKR08].
Akash Lal and I showed how any context-bounded-analysis problem can be
tranformed into an equivalent sequential problem; after applying the
transformation, any existing tool for sequential analysis can then be
applied [LR08c,
LR09].
When we compared the performance of the latter analysis
technique against a state-of-the-art tool for verifying
concurrent device drivers (Daniel Kroening's DDVerify,
using SMV as its analysis engine), we found that the
median improvement obtained was a 30-fold
speedup! (See Figure 7.9 on page 205 of
[L09].)
Context-bounded analysis was the subject of Akash Lal's
Ph.D. dissertation
[L09].
For this work, Akash was a co-recipient of the 2009 ACM SIGPLAN
John C. Reynolds Doctoral Dissertation Award.
Akash was also named to MIT Technology Review's
2011 India TR-35 list.
In 2023,
Akash and I, along with M. Musuvathi, S. Qadeer, and J. Rehof,
received the 2023 CAV Award
for the ``introduction of context-bounded analysis and its application to systematic testing of concurrent programs.''
[Back to the top]
Detecting Atomicity Violations
Our work on detecting atomicity violations focused on the problem
of detecting data races that may involve multiple data locations.
Problems can arise when a program is supposed to maintain an invariant
over multiple data locations, but accesses/updates are not protected
correctly by locks.
The tool that we built analyzes Java programs.
Originally, the analysis engine in the tool employed a semi-decision
procedure; however, after introducing two problem transformations
that were motivated by the desire to reduce the imprecision caused by the
abstractions that the tool applies
[KLR08,
KRDV09],
we discovered that for the model of concurrent programs
we were working with—which permits
(i) reentrant locks,
(ii) an unbounded number of context switches, and (iii) an
unbounded number of lock acquisitions—the problem is decidable
[KLTR09].
(The couplings between different processes are sufficiently weak
that the undecidability of checking emptiness of the intersection
of two context-free languages does not come into play.)
When we compared the performance of the decision procedure against
the semi-decision procedure, we found that the decision procedure
was 34 times faster overall (i.e., comparing the sum of the
running times on all queries, with a 300-second timeout).
Moreover, with the 300-second timeout threshold, for each query
where the semi-decision procedure timed out (roughly 49% of
the queries), the decison procedure succeeded within the
allotted time.
Detection of atomicity violations was the subject of
Nicholas Kidd's Ph.D. dissertation
[K09].
[Back to the top]
Policy Weaving
System security is a cross-cutting issue.
Unfortunately, it is hard to devise a formalism that allows
system-wide policies to be both (i) easily understood and (ii) easily
implemented and maintained. In particular, when the elements of
security policies are expressed directly in the code—e.g., via
calls on security primitives, as in Decentralized Information Flow
Control (DIFC) systems, such as Asbestos, HiStar, and Flume, or in
capability-based systems, such as Capsicum—the
policy implementation is spread throughout the codebase of an
application. With such a structure, it is difficult to understand what
overall policy the application actually implements, which complicates
documenting the policy, debugging the policy, and updating the policy
as the codebase evolves in response to changes in requirements and the
threat landscape.
In contrast, the focus of our work is on a declarative approach to system security.
In our approach, a policy specification is a first-class notion—e.g.,
the policy specification for a given application is a single
document, so that all elements of the policy are gathered together in
one place, rather than scattered throughout the codebase.
A security policy is still implemented in terms of a collection of
code snippets to invoke policy-enforcement primitives at different
points in the program.
However, users are freed from the task of introducing such snippets
manually.
The goal of our work is to create tools to support policy
weaving, which will handle the job of rewriting the program to place
code at appropriate locations so that suitable security primitives are
invoked at appropriate moments to enforce the policy.
A policy-weaving tool takes as input
(i) a program P, and
(ii) a policy R (written in some suitable policy-specification language),
and rewrites P to enforce R.
A policy-weaving tool supports a declarative approach to security.
The policy R can be stated as a single, easily-understood
document, and it is the obligation of the policy-weaving tool to find
a way to place code for invoking enforcement primitives so that all
executions of P satisfy R.
We also work on meta-tools to support policy weaving.
In particular, one of our goals is to create a policy-weaver generator.
Such a tool takes as input specifications of
- the semantics of a programming language L
- the semantics of the policy-enforcement primitives to be used
and generates a policy weaver.
Our vision is similar to aspect-oriented programming (in that a
cross-cutting issue is specified separately and
then automatically woven into the fabric of the
implementation).
However, the approach that we envision goes beyond
aspect-oriented programming in an important way: whereas conventional
aspect-oriented programming systems use syntactically-oriented
mechanisms and allow aspect code to be placed in a rather limited set
of locations, our approach is based on programming-language and
policy semantics to identify automatically
the program locations where policy-enforcement primitives need to be placed.
Our approach also goes beyond simple finite-state reference monitors:
policies have visibility into such entities as
processes, principals, and communication channels.
A key challenge is to design a policy weaver that can reason about the
potentially-unbounded set of states that a program may reach, and to
choose system primitives and parameters from a potentially unbounded
number of options to keep the program from reaching states that constitute a
violation of the policy provided by the programmer.
Our approach to designing a weaver generator is based on the theory of
two-player safety games.
Such a game is played by two players, called the Attacker and the
Defender, on a graph in which each node is labeled as an Attacker
node or a Defender node.
The state of the game is a node of the graph.
The goal of the Attacker is to reach a state labeled as an
accepting node;
the goal of the Defender is to thwart the Attacker by preventing
the Attacker from ever reaching an accepting node.
The game is played in turns:
on each turn, if the state is an Attacker state, then the Attacker
can change the state to any adjacent node of the graph, and
analogously for Defender states and Defender-initiated state changes.
A winning Defender strategy for a game is a
decision tree of choices that the Defender can make so that an
accepting state is never reached.
The core of a policy weaver created by our weaver generator is an
algorithm to find a winning Defender strategy
[HJR12].
A policy weaver generated by the policy-weaver generator works as follows:
given program P and policy R, the policy weaver constructs a game in
which
- the Attacker represents the threat model (i.e., all possible system attacks),
- the Defender represents possible instrumentations of the program, and
- accepting nodes represent system states reached when the instrumented program
violates the policy.
If the weaver finds a winning Defender strategy for the game, it uses
the strategy to instrument the program to satisfy the policy.
We have also been working on applying policy weaving to JavaScript
programs.
JavaScript is the most popular language used to create dynamic content
for web browsers.
Moreover, the nature of the Web has resulted in an ecosystem in which
JavaScript programs and libraries, often written by untrusted third
parties, are regularly incorporated into applications and combined in
every way imaginable.
While this practice has obvious benefits in terms of convenience and
not ``reinventing the wheel,'' it creates glaring security holes because
of a complete lack of isolation between modules in a JavaScript
environment.
We chose this context to apply our policy-weaving techniques because
of the obviously unfilled need for a security mechanism, and because
of the challenge presented by JavaScript's highly dynamic, untyped
nature.
We have created the the JAM policy weaver [FJJRPSY12],
coupled with the JAMScript enforcement mechanism.
(``JAM'' stands for JAvascipt Model checker.)
JAM can be used by a website administrator to apply a security
policy—in the form of a finite-state automaton—that
specifies a set of disallowed execution traces of a JavaScript
program.
The program is analyzed statically with respect to the policy, and an
instrumented program that is guaranteed to not violate the policy
specification is produced.
This analysis consists of a counterexample-guided,
abstraction-refinement loop that repeatedly creates an incrementally
more precise abstract model of the program.
Because this model represents a superset of the program's actual
behaviors, we can say that any execution traces that are not
realizable in the model also cannot happen in the concrete program.
Static analysis can be expensive, especially in the context of a
highly dynamic language such as JavaScript.
Thus, the algorithm used in JAM provides for a system-resource cutoff threshold,
which terminates the analysis when the threshold is reached.
This case still results in a safely instrumented program, although the
program may contain some amount of spurious instrumentation, resulting
in extra runtime overhead.
In this way, JAM can be tuned to balance the trade-off between
static-analysis overhead and dynamic overhead.
The instrumentation that JAM weaves into a JavaScript program comes in
the form of transactions.
Potentially malicious statements in the untrusted code (i.e., those
that could not be proven safe by the static analysis) are wrapped in
transaction blocks, which are speculatively executed in a manner that
records all actions (reads, writes, and function calls) and suppresses
all effects of the actions.
This approach allows the effects of the statements to be evaluated
with respect to the security policy, at run-time inside the transaction.
The program is terminated—without the effects of the actions
being visible externally—if it is determined that any prefix
of the sequence of actions will violate the policy.
If no policy-violating action sequence is observed, the transaction is
committed, and execution continues as before.
[Back to the top]
Incremental Computing
The attribute-updating algorithm that I developed as part of my
Ph.D. research
[R82,
RTD83,
R84]
is an example of an "incremental updating algorithm"—an
algorithm that makes use of the solution to one problem instance
I1 to find the solution to a nearby problem instance
I2, where I2 is created by making
a small change to problem instance I1.
Over the years, I have encountered such problems in a variety of
contexts, including
interactive program-development environments
[R82,
RTD83,
R84,
RMT86,
R88],
dynamic graph problems
[RR94a,
RR94b,
RR96a,
RR96b],
the construction of abstract transformers for program-analysis
problems
[RSL03,
RSL10,
EGR10],
and propagation strategies in dataflow-analysis solvers
[RSJM05].
My interest in incremental computing also led me to look at ways of
transforming a program that computes a function
F(x) on floating-point numbers into programs that
compute good approximations to the derivative F'(x), the
first divided difference
F[x0,x1], Taylor-series
expansions of F(x), and higher-order divided differences of
F(x), as well as multivariate versions of such transformations
[RR00,
RR03]
(see computational differentiation and computational divided differencing).
Incremental computing was the subject of G. Ramalingam's
Ph.D. dissertation
[R96].
Technology Transfer
The incremental algorithm for the dynamic shortest-hyperpath
problem in directed hypergraphs that my student G. Ramalingam and I
developed [RR96b]
has been used by AT&T to improve the performance of Open Shortest Path
First (OSPF), the most commonly used Internet intra-domain routing
protocol.
The incremental shortest-hyperpath algorithm is used as a subroutine
in a heuristic search algorithm for optimizing OSPF weights (for a
given set of projected demands) and thereby reduce congestion
[FT04].
The speed-up obtained by using our incremental algorithm was reported to be a factor of
15
to
20.
In 2004, Mikkel Thorup told me that AT&T was using their results to adjust
capacities in the AT&T network when they had to schedule link downtimes
for maintenance. Apparently this happens quite frequently: he indicated
that on the order of a dozen links had to be serviced each day.
More recently, the ideas used in [RR96b]
were combined with the A* search algorithm from the field of AI
planning to create several
incremental versions of A* search.
These algorithms were used to solve planning problems and generate dynamically
constrained trajectories for the full-size SUV of CMU's Tartan Racing Team,
which won the DARPA 2007 Urban Challenge race for autonomous vehicles.
[Back to the top]
Computational Differentiation and Computational Divided Differencing
Tools for computational differentiation transform a program that
computes a numerical function F(x) into a related program that
computes F'(x) (the derivative of F).
Louis Rall (of the UW Mathematics Department) and I
described how techniques similar to those used in computational-differentiation
tools can be used to implement other program transformations—in
particular, a variety of transformations for computational divided
differencing
[RR00,
RR03],
including multivariate versions of such transformations.
[Back to the top]
Representing Boolean and Pseudo-Boolean Functions
In summer 2018, Meghana Sistla
visited the University of Wisconsin, supported by the
S.N. Bose Scholars Program,
and we worked to resurrect an idea of mine from the late 90s that had been
rejected by the USPTO
in 2002.
The idea was for a plug-compatible replacement for BDDs for
representing Boolean and pseudo-Boolean functions, called CFLOBDDs
(for ``Context-Free Language Ordered Binary Decision Diagrams'').
The key insight behind CFLOBDDs is that sub-decision-diagrams can be
treated as standalone ``procedures,'' which allows additional reuse
opportunities compared to BDDs.
In the best case, one can obtain double-exponential compression
compared to decision trees—i.e., an extra exponential factor over (quasi-)BDDs.
Even when such best-case compression is not possible, the ability to have
``procedure calls'' in CFLOBDDs can allow there to be additional sharing
of structure beyond what is possible in BDDs:
a BDD can only share sub-DAGs, whereas a procedure call in a CFLOBDD can also share
the ``middle of a DAG.''
Meghana joined Swarat Chaudhuri's
research group at UT-Austin, and we have been continuing the work with Swarat.
25 years ago, I never found a compelling application where CFLOBDDs beat BDDs,
but in the last few years we have found that CFLOBDDs can be quite good for
simulating quantum circuits [SCR24a].
The implementation is available at this
[link].
Our Quasimodo
tool for quantum simulation—which supports BDDs, Weighted BDDs, and CFLOBDDs—is
available on GitHub at this [link],
and described in a CAV23 tool paper
[SCR23].
We have also developed a weighted version of CFLOBDDs, called WCFLOBDDs
[SCR24b].
[Back to the top]
|