Past Research Accomplishments
Capsule Summaries
During my 45year career in Computer Science, the overriding goal of my
research has been to find new ways to help programmers create
correct, reliable, and secure software. To address these issues,
I developed a number of new methods to manipulate programs and analyze
their properties. Often these have taken the form of
frameworks that apply to general classes of problems, e.g.,
 creating interactive programdevelopment environments
[RT88]
 solving graphreachability problems
[R98]
 analyzing programs that manipulate linked data structures
[SRW02]
 solving interprocedural dataflowanalysis problems
[RSJM05]
 analyzing machine code
[LR08a]
 synthesizing programs
[KHDR21].
As Richard Hamming has put it, the benefit of developing a solution
in the form of a framework is that it allows the work to address
"all of next year's problems, not just the one [currently] in front of [your] face"
[H86, p. 7].
My research interests have spanned programdevelopment environments,
software engineering, dataflow analysis, and more recently,
computer security, software verification, and program synthesis, with some excursions
into other subjects, as well.
The key achievements can be summarized as follows:
 In 197879,
Tim Teitelbaum and I created the Cornell Program
Synthesizer [TR81],
the first example of an interactive programdevelopment environment
for a stronglytyped, blockstructured, imperative programming
language. It provided integrated, screenoriented facilities
to create, edit, execute, monitor, singlestep, and debug
programs. The Cornell Program Synthesizer was similar
to modern programdevelopment environments, such as
Microsoft Visual Studio
and Eclipse,
but predated them by over two decades.
[More]
 From 1980 to 1987,
Tim Teitelbaum and I worked on methods to
create programming tools similar to the Cornell Program
Synthesizer automatically, but for languages other than the PL/I
subset that the Cornell Program Synthesizer supported.
We developed an editorgenerator tool, the Synthesizer Generator
[RT88],
which automatically creates an editor for a given language
L from a description of L—including a
description of the properties that the editor should check
incrementally as Ldocuments are developed. Among other
applications, the Synthesizer Generator was used to create some
of the first screenoriented, interactive, automated proof assistants
[RA84,
G88,
GMP90].
[More]
 In 1988,
Susan Horwitz and I, together with our student David Binkley, showed how to
perform contextsensitive interprocedural slicing
[HRB88].
Our work reignited interest in program slicing in the research
community, which continues to this day. It also led to
the development of a number of commercial programslicing products.
Commercial slicers include GrammaTech's
CodeSurfer for C and C++,
which, under a license from the Wisconsin Alumni Research Foundation,
is derived from the ANSI C slicer that we developed at Wisconsin, and IBM's
Rational Transformation Workbench,
which was developed to address such coderenovation issues as
the Year 2000 Problem ("Y2K bug") and conversion of financial software
to support adoption of the Euro.
[More]
 In the early 90s,
my student G. Ramalingam and I investigated
incremental algorithms for dynamic graph problems. The goal
was to use the solution to one problem instance
G_{1} to find the solution to a nearby problem
instance G_{2}, where G_{2} is
created from G_{1} by inserting or deleting a
small number of nodes or edges. In 1992, we developed an
incremental algorithm for the dynamic shortesthyperpath
problem in directed hypergraphs [RR96b].
The algorithm has been used by AT&T in their Open Shortest Path
First (OSPF) Internet intradomain routing protocol, achieving a
15fold
to
20fold
speedup in the heuristic search performed by the protocol.
More recently, the ideas used in [RR96b]
were combined with the A^{*} search algorithm from the field of AI
planning to create several
incremental versions of A^{*} search.
These algorithms were used to solve planning problems and generate dynamically
constrained trajectories for the fullsize SUV of CMU's Tartan Racing Team,
which won the DARPA 2007 Urban Challenge race for autonomous vehicles.
[More]
 From 1993 to 1997,
Susan Horwitz, Mooly Sagiv, and I showed
that the problem of contextfreelanguage reachability
(CFLreachability) in directed graphs is a key concept in
program analysis
[R98].
(CFLreachability is the variant of reachability in which a path
from node s to node t only counts as a valid
st connection if the path's labels form a word
in a given contextfree language.) We showed that a large
class of programanalysis problems could be solved by
transforming them into instances of CFLreachability.
The idea is now a standard concept in the programminglanguages
community, and a CFLreachability solver is one of the
key analysis components found in most systems for software
model checking and bug detection.
[More]
 In 1999,
Mooly Sagiv, Reinhard Wilhelm, and I developed a
tunable framework for analyzing and establishing properties of
programs that manipulate linked data structures
[SRW02].
The work addressed the problem of analyzing programs that use
dynamic allocation and freeing of storage cells and destructive
updating of structure fields. In such programs, data
structures can grow and shrink dynamically, with no fixed upper
bound on their size or number. Our framework provides a way to
create finitesized descriptors of memory configurations that
(i) abstract away certain details, but (ii) retain enough key
information so that an analyzer can identify interesting
nodelinkage properties that hold (a.k.a. "shape" properties)
[LARSW00,
JLRS04,
LRS06,
JLRS10].
Our work galvanized the programminglanguages research
community to finally tackle the problem of pointers, aliasing,
and linked data structures in programanalysis and
programverification tools, and ignited an enormous effort on
the topic over the last ten years.
[More]
 In 2003,
Somesh Jha, Stefan Schwoon, and I developed a new
framework for interprocedural dataflow analysis
that supports a strictly richer set of queries than previous
approaches to interprocedural dataflowanalysis
(a problem that has a
35year history).
The algorithms that we developed
[RSJM05],
along with improvements developed with my students Akash Lal
and Nicholas Kidd
[LRB05,
LR06,
LKRT07,
LR08b],
have been implemented in two libraries for use in creating
programanalysis tools
[KRML04,
KLR07].
[More]
 In 2003,
my student Gogul Balakrishnan and I developed a
staticanalysis algorithm for x86 machine code
that—without relying on symboltable or debugging
information—is able to track the flow of values through memory
[BR04,
BR10].
We used it to create the first programslicing tool for machine
code that could help with understanding dependences across
memory updates and memory accesses
[BGRT05].
Later, we extended the tool to create the first automatic
programverification tool for stripped executables: it allows
one to check that a stripped executable conforms to an
APIusage rule (specified as a finitestate machine)
[BRKLLMGYCT05,
BR08].
[More]
 In 2008,
my student Junghee Lim and I created the TSL system
[LR08a,
LR12].
(TSL stands for Transformer Specification Language.)
From a single specification of the concrete
semantics of a machinecode instruction set, TSL automatically
generates correctbyconstruction implementations of the
statetransformation functions needed in
statespaceexploration tools that use static analysis, dynamic
analysis, symbolic analysis, or any combination of the three
[LR08a,
LLR09,
LLR10,
TLLBDEAR10].
The generated implementations are guaranteed to be mutually
consistent, and also to be consistent with an instructionset
emulator that is generated from the same specification of the
concrete semantics.
[More]
 From 2005 to 2009,
my students Akash Lal, Nicholas Kidd, and I developed
methods for analyzing properties of concurrent programs, via
model checking. (Our collaborators included Sagar Chaki, Ed
Clarke, Julian Dolby, Peter Lammich, Tayssir Touili, and
Mandana Vaziri.)
We investigated two different kinds of modelchecking problems,
and for each problem the technique that we developed was
about 30 times faster than the best previous technique for the
problem.
One line of research focused on
"contextbounded
analysis": a program's behavior is explored for a fixed
number of context switches (but running threads are allowed to
perform an a priori unbounded amount of work between
context switches). We obtained several fundamental results,
including a way in which any contextboundedanalysis problem can
be transformed into an equivalent sequential problem (so that
existing tools for sequential analysis can then be applied)
[LR08c,
LR09].
Another line of research focused on the problem
of detecting data races that may involve
multiple data locations. (Problems can arise when
a program is supposed to maintain an invariant over multiple
data locations, but accesses/updates are not protected
correctly by locks.) We showed that for a model of
concurrent programs that permits (i) reentrant locks,
(ii) an unbounded number of context switches, and (iii) an
unbounded number of lock acquisitions, the multilocation
dataracedetection problem is decidable
[KLTR09].
[More]
 Since 2002,
I have been interested in an interesting connection between
abstract interpretation and logic—in particular, how to harness
decision procedures to obtain algorithms for several fundamental
primitives used in abstract interpretation
[RSY04,
YRS04,
TR12a,
TR12b,
TER12,
ELSAR13,
TLLR13].
In [RSY04],
Mooly Sagiv, Greta Yorsh, and I gave the first method that makes
such a connection: we showed that, for a broad class of situations, given
a formula φ in some logic L, it is possible to find the best value a in
a given abstract domain Abs that overapproximates the
meaning of φ (i.e., [[φ]] ⊆ γ(a)).
From 2011 to 2016, my student Aditya Thakur and I pursued
several new insights on this fundamental question.
Several of the methods that we have developed are "dualuse".
That is, in addition to providing methods for building improved
abstractinterpretation tools, they also provide methods for building
improved logic solvers that use abstract interpretation to speed up
the search that a solver carries out.
[More]
 From 2010 to 2014,
Somesh Jha, Bill Harris, Rich Joiner, and I worked on policy weaving,
a declarative approach to system security.
In our approach, a policy specification is a firstclass notion—e.g.,
the policy specification for a given application is a single
document, so that all elements of the policy are gathered together in
one place, rather than scattered throughout the codebase.
A security policy is implemented in terms of a collection of
code snippets that invoke policyenforcement primitives at different
points in the program;
however, users are freed from the task of introducing such snippets
manually by a policyweaving tool, which takes as input
(i) a program P, and
(ii) a policy R (written in some suitable policyspecification language),
and rewrites P to enforce R.
[More]
 Starting in 2011, Emma Turetsky, Prathmesh Prabhu,
and I worked on a method to improve the dataflowanalysis algorithm of Esparza et al.,
which provides a way to apply Newton's Method
to programanalysis problems. They gave a new way to find fixedpoints of
systems of equations over semirings.
As in its realvalued counterpart, each iteration of their
method solves a simpler ``linearized'' problem.
However, there is an important difference between the
dataflowanalysis and numericalanalysis contexts: when Newton's
method is used on numericalanalysis problems, multiplicative
commutativity is relied on to rearrange expressions of the form
``a * X * b'' into ``X * a * b.'' Equations of the form
``X = X * a * b'' correspond to path problems described by regular
languages. In contrast, when Newton's method is used for
interprocedural dataflow analysis, the ``multiplication'' operation
involves function composition, and hence is noncommutative:
``X = a * X * b'' cannot be rearranged into ``X = X * a * b,'' and
thus the equations correspond to path problems described by linear
contextfree languages (LCFLs).
We used some algebraic slightofhand to turn a class of linear
contextfree (LCFL) path problems into regularlanguage path
problems
[RTP17,
RTP16].
This result is surprising because a reasonable sanity
check—formallanguage theory—suggests that it should
be impossible: after all, the LCFL languages are a strict superset
of the regular languages.
Zak Kincaid, Jason Breck, John Cyphert, and I have continued
this line of research, and developed a number of additional
techniques
[KBFBR17,
KCBR18,
KBCR19,
CBKR19,
BCKR20,
KRC21].
[More]
 Since 2019, I have been collaborating
with Loris D'Antoni and his students (Qinheping Hu, Jinwoo Kim, Kanghee Park, and
others) on a new framework for program synthesis, which we call
``semanticsguided synthesis'' (SemGuS).
The SemGuS framework captures the core computational problem of
program synthesis, without being tied to a specific solution,
implementation, or domain of application.
That is, it is both domainagnostic and solveragnostic.
What the SemGuS work brings to the table is the idea of
parameterizing the core synthesis algorithm with the semantics
of the primitives out of which the target program is to be
constructed.
To achieve this goal, it is necessary to
distill out the essence of many programsynthesis problems into
a specification formalism that is ground in formal methods
(e.g., automata and logic) and is agnostic to any specific
domain of application. This degree of abstraction also opens
the opportunity to lift certain synthesis algorithms and ideas
to a higher level that makes these algorithms reusable across
different tools.
[More]

Meghana Sistla,
Swarat Chaudhuri, and I
are working on data structures that can provide highly compressed representations
of Boolean and pseudoBoolean functions.
In the best case, one can obtain doubleexponential compression
compared to decision trees—i.e., an extra exponential factor over (quasi)BDDs.
We have found that CFLOBDDs can be quite good for simulating quantum circuits.
[More]
Along the way, we made several contributions to the field of
incremental computation
[RTD83,
RR94a,
RR94b,
RR96a,
RR96b,
RSL03,
RSL10,
EGR10];
introduced canonical abstraction, a new family of
abstractions for use in program analysis and verification
[SRW02];
made numerous innovations in techniques for program analysis
[RR95,
HRS95,
RT96,
MR98,
RBRSW05,
LRB05,
LR06,
BR06,
GR07,
LKRT07,
LTKR08,
LR08b,
KLR08,
LR08c,
KRDV09,
LR09,
KLTR09];
developed techniques to
convert C programs to C++ programs that
take advantage of the improved features of C++,
such as templates and classes;
worked on methods to address the Year 2000 problem;
and identified several interesting connections between programanalysis
problems and ideas from databases
[R94,
RSL03,
RSL10,
EGR10],
machine learning
[RSY04,
YRS04,
LRS05,
TER12],
and access control of shared computing resources in distributed systems
[SJRS03,
JR04,
JSWR06].
[Back to the top]
Interactive ProgramDevelopment Environments
The Cornell Program Synthesizer [TR81],
which Tim Teitelbaum
and I started working on in 1978, was the first example of an
interactive programdevelopment environment with integrated, screenoriented
facilities for a stronglytyped, blockstructured, imperative
programming language. It had a tightly
integrated combination of facilities for programmers to create, edit,
execute, monitor, singlestep, and debug their PL/C programs. In other words,
by 1979, we had created the first version of an interactive
programming environment similar to modern tools, such as
Microsoft Visual Studio
and Eclipse.
At the time that it was created (197879), the Cornell Program Synthesizer
was a radically new kind of programming tool. Along with the Mentor
system of DonzeauGouge, Kahn, Huet, and Lang
[DGKHL80],
it showed that the kind of directmanipulation programming tools that had been
pioneered in the SmallTalk
and InterLisp systems
could be created for a mainstream programming language (i.e., one that was
stronglytyped, blockstructured, and had imperative programming constructs).
The Cornell Program Synthesizer was used successfully in introductory
programming classes by about 20,000 students over a fiveyear period
at Cornell, Rutgers, Princeton, UCLA, and several other universities.
The Cornell Program Synthesizer was distributed to researchers at
about 110 universities, research institutes, and industry
laboratories.
Through lectures that were given about the system and copies that were
distributed, it is possible to trace its direct influence on several
commercial products, including DEC's
Language Sensitive Editor,
Rational's Ada programming environment,
and Symantec's
LightspeedC and LightspeedPascal programming environments.
It is more difficult to connect the dots to the similar products that
hundreds of thousands or millions of people use on a daily basis today;
however, together with the initial paper about the system
[TR81],
the Cornell Program Synthesizer was very influential: as a model for
programming tools of the future, it helped to spark widespread
interest in the question of how to create better tools for developing,
analyzing, testing, and debugging programs.
[Back to the top]
My Ph.D. dissertation [R84],
which was honored with the 1983
ACM Doctoral Dissertation Award,
investigated the problem of automating the creation of programming
tools similar to the Cornell Program Synthesizer, but for languages
other than PL/I.
The goal of the research was to create an editorgenerator tool to
enable editors for different languages to be created automatically
from a language description. In the solution I proposed, language
descriptions are specified by giving
 a contextfree grammar for the textual input language,
 a regulartree grammar for the language's underlying abstract syntax,
 sets of "attributes" (annotations) to be attached to nodes of the
language's abstractsyntax trees,
 rules that describe how information from one node's attributes
affects the attributes of other nodes, and
 prettyprinting rules that describe how abstractsyntax trees
and their attributes are displayed.
In essence, each editor can be thought of as a specialized
"spreadsheet" for documents written in the given language. After an
editing operation changes the document's underlying abstractsyntax
tree, some of the
tree's attributes may no longer be consistent with one another;
attributes must be updated to reestablish consistency (similar to the
way values are updated in a spreadsheet program). A major result of
my Ph.D. research was an optimal algorithm that I devised to solve the
attributeupdating problem
[R82,
RTD83,
R84].
As part of my Ph.D. research, I developed a prototype implementation
of an editorgenerating system that made use of the ideas described
above. I spent 198285 as a postdoc at Cornell (in residence during
198283 at INRIA, the national institute for computer science in
France) collaborating with Tim Teitelbaum on the design and
implementation of the Synthesizer Generator, a fullscale version of
such a system
[RT84,
RT87,
RT88a].
The Synthesizer Generator creates a languagespecific editor from an
input specification that defines a language's syntax, display format,
transformation rules for restructuring "documents" written in the
language, and attribution rules that codify contextsensitive
relationships between document elements. From this specification, the
Synthesizer Generator creates a specialized editor for manipulating
documents according to these rules. The Synthesizer Generator has
proven to be of great utility in a wide range of applications,
including program editors, textpreparation systems, and tools that
verify the correctness of proofs (for several different kinds of
mathematical and programming logics
[RA84,
G88,
GMP90]).
In 2010, our 1984 paper about the Synthesizer Generator
[RT84] received an
ACM SIGSOFT Retrospective Impact Paper Award.
Beginning in August 1985, the Synthesizer Generator was made available
to universities and industrial research organizations.
Since distribution began it has been licensed
to about 330 sites in 23 countries. Two books describing the system,
which Teitelbaum and I coauthored, were published by SpringerVerlag
in 1988 [RT88a,
RT88b].
Also in 1988, Teitelbaum and I founded a company
(GrammaTech, Inc.) to
commercialize the Synthesizer Generator technology.
GrammaTech subsequently created two products for the Ada programming language
based on the Synthesizer Generator,
AdaASSURED
and AdaUtilities,
and, since the mid90's, these have been used in the development of almost every major
Department of Defense system written in Ada.
[Back to the top]
Software Engineering
Starting in 1986, I carried out a great deal of work exploring how
program slicing
(an operation originally proposed in 1979 by the late Mark Weiser)
can serve as the basis for semanticsbased programmanipulation operations.
The slice of a program with respect to a set of program elements
S is
a projection of the program that includes only program elements that
might affect (either directly or transitively) the values of the
variables used at members of S. Slicing allows one to find
semantically meaningful decompositions of programs, where the
decompositions consist of elements that are not textually contiguous.
Program slicing is a fundamental operation that can aid in solving
many softwareengineering problems
[HR92].
For instance, it has applications in program understanding, maintenance, debugging,
testing,
differencing [HR91],
specialization [RT96],
reuse, and merging of program variants
[HPR89].
The projects that I carried out with my students and coworkers were aimed at
(Click here
for the home page of the Wisconsin ProgramSlicing Project.)
The conference paper on interprocedural slicing that
Susan Horwitz,
David Binkley,
and I wrote [HRB88]
was selected for inclusion in a special SIGPLAN collection of the
50 most influential papers from the SIGPLAN Conference on Programming
Language Design and Implementation from 1979 to 1999:

Horwitz, S., Reps, T., and Binkley, D.,
Interprocedural slicing using dependence graphs.
20 Years of the ACM SIGPLAN Conference on Programming Language Design
and Implementation (1979  1999): A Selection, K.S. McKinley, ed.,
ACM SIGPLAN Notices 39, 4 (April 2004), 232243.
A journal version of
[HRB88]
appeared as
[HRB90].
(Of all my papers, this is the most highlycited;
e.g., see
Google Scholar
and CiteseerX.)
A retrospective on [HRB88]
was published as

Horwitz, S., Reps, T., and Binkley, D.,
Retrospective: Interprocedural slicing using dependence graphs.
20 Years of the ACM SIGPLAN Conference on Programming Language Design
and Implementation (1979  1999): A Selection, K.S. McKinley, ed.,
ACM SIGPLAN Notices 39, 4 (April 2004), 229231.
[retrospective (PS);
retrospective (PDF)]
In 2011, our 1994 paper "Speeding up slicing"
[RHSR94] received an
ACM SIGSOFT Retrospective Impact Paper Award.
In more recent work, my students, colleagues, and I have developed even
better methods for program analysis and slicing based on a formalism called
weighted pushdown systems.
In collaboration with Genevieve Rosay and Susan Horwitz,
I built a system, called the
Wisconsin ProgramSlicing Tool,
that implemented slicing of ANSI C programs.
Although the Wisconsin ProgramSlicing Tool
was distributed from 19962000 for notforprofit
research purposes under license from the University of Wisconsin,
it is no longer being distributed.
However, there is a commercial tool available, named
CodeSurfer,
that is derived from the Wisconsin implementation and is available from
GrammaTech, Inc.,
which licensed the technology from the Wisconsin Alumni Research Foundation (a
notforprofit foundation that uses royalty income from the
technologies it licenses to support research at the University of
Wisconsin).
GrammaTech has improved on the Wisconsin implementation considerably
(both in terms of speed and space efficiency), and extended it to
support C++.
GrammaTech also offers
CodeSurfer to academic researchers
on very favorable terms.
In 2007, Tavis Ormandy of Gentoo Linux said the following about CodeSurfer:
"Our team is responsible for investigating and verifying security problems
in any of the 10,000+ open source programs we distribute. CodeSurfer has
made a dramatic impact on the time it takes us to isolate and understand
security flaws, allowing us to traverse complex and unfamiliar codebases
easily and trace problems to their source."
A group at SAIC/NASA had this to say about CodeSurfer:
"Our group analyzes many missioncritical software projects to
reduce defects and ensure that the software meets requirements. We
conducted a formal study to see if CodeSurfer could improve our
software inspections. We found that CodeSurfer reduced inspection
time by an average of 30%. In addition, when using CodeSurfer the
number of defects found more than doubled."
Slicing was also used in many commercial tools that were developed to address
such coderenovation issues as the Year 2000 Problem ("Y2K bug") and
conversion of financial software to support adoption of the Euro.
One example is IBM's
Rational Transformation Workbench.
(A list of more than 20 such tools can be found
here.)
[Back to the top]
My student Mike Siff and I investigated techniques for converting C programs
to C++ programs that take advantage of the improved features
of C++, such as templates and classes.
For instance, in
[SR96]
we considered the problem of software generalization: Given a program
component C, create a parameterized program component
C' such that C' is usable in a wider variety of
syntactic contexts than C. Furthermore, C' should
be a semantically meaningful generalization of C; namely,
there must exist an instantiation of C' that is equivalent in
functionality to C.
We gave an algorithm that generalizes C functions via type inference:
the original functions operate on specific data types; the result of
generalization is a collection of C++ function templates that operate
on parameterized types.
We also described a general technique for identifying modules in legacy
code [SR99].
The method is based on concept analysis—a branch of lattice
theory that can be used to identify similarities among a set of
objects based on their attributes. We showed how concept analysis can
identify potential modules using both "positive" and "negative"
information, and presented an algorithmic framework to construct a
lattice of concepts from a program, where each concept represents a
potential module.
The development of better techniques for software renovation
was the subject of Mike Siff's Ph.D. dissertation
[S98].
[Back to the top]
In 1996, I was asked by the Defense Advanced Research Projects Agency
(DARPA) to plan a project aimed at reducing the impact of the Year
2000 problem on the Department of Defense. DARPA was particularly
interested in whether there were "any techniques in the research
community that could be applied to the Y2K problem and have impact
beyond present commercial Y2K products and services." I spent a
semester as a kind of talent scout for DARPA to see what ideas might
be worth pursuing. The most exciting of the ideas that turned up
concerned a method for using path profiling as a heuristic to locate
some of the sites in a program where there are problematic date
manipulations [RBDL97]
(and more generally as an aid for testing and debugging).
A path profiler instruments a program so that the number of times each
different loopfree path executes is accumulated during an execution
run. With such an instrumented program, each run of the program
generates a path spectrum for the execution—a distribution of the
paths that were executed during that run. A path spectrum is a
finite, easily obtainable characterization of a program's execution on
a dataset, and provides a behavior signature for a run of the program.
Our techniques are based on the idea of comparing path spectra from
different runs of the program. When different runs produce different
spectra, the spectral differences can be used to identify paths in the
program along which control diverges in the two runs. By choosing
input datasets to hold all factors constant except one, the divergence
can be attributed to this factor. The point of divergence itself may
not be the cause of the underlying problem, but provides a starting
place for a programmer to begin his exploration.
One application of this technique is in the "Year 2000 Problem" (i.e.,
the problem of fixing computer systems that use only 2digit year
fields in datevalued data). In this context, pathspectrum
comparison provides a heuristic for identifying paths in a program
that are good candidates for being datedependent computations.
A few years later, we found out that essentially the same idea had been discovered
earlier by J. Collofello and L. Cousin [CC87],
as well as by N. Wilde
(who has done extensive research on the subject, and uses the term
"software reconnaissance",
e.g., see [W94,
WC96]).
[Back to the top]
Dataflow Analysis
Interprocedural Dataflow Analysis and ContextFreeLanguage Reachability
Work that I carried out with Susan Horwitz, Mooly Sagiv, and David
Melski established some interesting connections between the work on
interprocedural program slicing and a number of other programanalysis
problems, such as classical interprocedural dataflow analysis.
In particular, we showed that a large class of programanalysis
problems can be solved by transforming them into instances of a
special kind of
graphreachability problem,
called contextfreelanguage reachability (CFLreachability)
[RHS95,
HRS95,
R95b,
R96,
MR98,
R98,
R00].
CFLreachability is the variant of reachability in which a path
from node s to node t only counts as a valid
st connection if the path's labels form a word
in a given contextfree language.
The notion of CFLreachability was originated by
Mihalis Yannakakis,
who developed it in the context of
graphtheoretic methods in database theory.
Contextfreelanguage reachability problems can be solved in
polynomial time by algorithms similar to the ones we originally developed
for interprocedural slicing [HRB90,
RHSR94].
David Melski and I studied the relationship [MR98]
between CFLreachability and the programanalysis formalism of set constraints, which was
introduced by John Reynolds
and subsequently studied by
Alex Aiken and
Nevin Heintze.
The PowerPoint presentation prepared for the tutorial that I gave on
this material at PLDI '00 can be found
here.
In more recent work, my students, colleagues, and I have developed even
better methods for interprocedural dataflow analysis based on a formalism called
weighted pushdown systems.
In addition to the CFLreachabilitybased slicing algorithms that are
used in
CodeSurfer,
the work on program analysis via CFLreachability had direct impact on
many systems for software model checking and bug detection:
a CFLreachability solver is one of the key analysis components used in most
such systems. Among them is the wellknown
SLAM tool from
Microsoft, which checks C programs against temporal safety properties
to find defects in Windows device drivers—thereby addressing one
of Windows' main weak points.
In particular, as described in
[BR01],
one of the key algorithms used in SLAM is based on the interprocedural dataflowanalysis
algorithm given in [RHS95].
Due to the importance of the problem that it addressed,
the SLAM work even got on Bill Gates's radar screen.
For instance, in a keynote address at the Windows Hardware Engineering
Conference in 2002, Gates touted the work on SLAM as follows:
"… software verification … has been the Holy Grail of Computer
Science for many decades, but now in some very key areas, for
example, driver verification, we're building tools that can do
actual proof about the software and how it works in order to
guarantee the reliability."
(See also
Gates's address at OOPSLA 2002.)
Thus, as one of the key technologies in a tool that has helped to make
Windows much more stable in recent years, CFLreachability has had a
significant impact on the computing experience of hundreds of millions
of people worldwide.
[Back to the top]
The technique for static program analysis using 3valued logic
that Mooly Sagiv, Reinhard Wilhelm, and I developed
[SRW02,
RSW04]
addresses the problem of analyzing programs that use dynamic
allocation and freeing of storage cells and destructive updating of
structure fields—features found in all modern programming
languages, including C, C++, and Java. In such programs, data
structures can grow and shrink dynamically, with no fixed upper bound
on their size or number. In the case of threadbased languages, such
as Java, the number of threads can also grow and shrink dynamically.
We developed a way to create finitesized descriptors of
memory configurations that (i) abstract away certain details, but (ii)
retain enough key information so that the analysis can identify
interesting properties that hold.
We used these ideas to create a framework for program
analysis that is capable of expressing rich properties of memory
configurations, and confirming such behavioral properties as
 when the input to a listinsert program
is an acyclic list, the output is an acyclic list, and
 when the input to a listreversal program that uses
destructiveupdate operations is an acyclic list, the output is an
acyclic list.
The framework can be instantiated in different ways to create
different programanalysis algorithms that provide answers to
different questions, with varying degrees of efficiency and precision.
The work was originally motivated by the problem of
shape analysis
for programs written in languages that permit destructive updating
of dynamically allocated storage.
The aim of shape analysis is to discover information
(in a conservative way) about the possible "shapes"
of the dynamically allocated data structures to which a program's pointer
variables can point.
As we investigated this problem more deeply, we discovered
methods that allowed us to deal with properties of memory configurations
other than just pure shape properties.
For instance, recent work has concerned numeric properties
[GDDRS04].
The key aspect of our approach is the way in which it makes use of
2valued and 3valued logic: 2valued and 3valued
logical structures—i.e., collections of relations—are used to
represent concrete and abstract stores, respectively;
individuals represent entities such as memory cells, threads,
locks, etc.;
unary and binary relations encode the contents of
variables, pointervalued structure fields, and other aspects of
memory states; and firstorder formulas with transitive closure are used
to specify properties such as sharing, cyclicity, reachability, etc.
Formulas are also used to specify how the store is affected by the
execution of the different kinds of statements in the programming
language.
The analysis framework can be instantiated in different ways by
varying the relation symbols of the logic, and, in particular, by varying
which of the unary relations control how nodes are folded
together. The specified set of relations
determines the set of properties that will be tracked,
and consequently what properties of stores can be discovered to
hold at different points in the program by the corresponding instance
of the analysis.
As a methodology for verifying properties of programs, the advantages
of the 3valuedlogic approach are:

No loop invariants are required.

No theorem provers are involved, and thus every abstract execution
step must terminate.

The method is based on
abstract interpretation
and satisfies conditions that guarantee that the entire process
always terminates.

The method applies to programs that manipulate pointers and
heapallocated data structures.

The method eliminates the need for the user to write the usual
proofs required with abstract interpretation—i.e., to
demonstrate that the abstract descriptors that the analyzer
manipulates correctly model the actual heapallocated data
structures that the program manipulates.
A PowerPoint presentation on some of this material can be found
here.
A prototype implementation that implements this approach has been
created, called TVLA
(ThreeValuedLogic Analyzer).
In addition to
[SRW02,
RSW04],
some of the other key papers about this work include those on
interprocedural shape analysis
[JLRS04,
JLRS08],
numeric properties together with shape properties
[GDDRS04],
applications to program verification
[LRSW00,
LRS06,
LRS07,
ARRSY07],
the use of ideas from incremental computing and databases
to construct abstract transformers
[RSL03,
RSL10,
EGR10],
and the use of machine learning to identify appropriate abstractions
[LRS05].
The Ph.D. dissertations of Alexey Loginov
[L06]
and Denis Gopan
[G07]
developed improved techniques for static analysis based on 3valued logic.
[Back to the top]
The framework for interprocedural dataflow analysis that
Somesh Jha, Stefan Schwoon, and I developed, based on
weighted pushdown systems [RSJM05],
supports a strictly richer set of queries than previous approaches to
interprocedural dataflowanalysis (a problem that has a
35year
history).
In general, the approach can be applied to any distributive
dataflowanalysis problem for which the domain of transfer functions
has no infinite descending chains (including problems that cannot
be expressed as bitvector problems, such as linear constant propagation
[SRH96]
and affinerelation analysis
[MOS04]).
Safe solutions are also obtained for problems that are monotonic but
not distributive.
The benefits of the approach can be summarized as follows:
 Conventional dataflowanalysis algorithms merge together the
values for all states associated with the same program point,
regardless of the states' calling context. With the
dataflowanalysis algorithm obtained via WPDSs, dataflow
queries can be posed with respect to a regular language of
stack configurations. (Conventional merged dataflow
information can also be obtained by issuing appropriate
queries.)
 Because the algorithm for solving weighted path problems on
WPDSs can provide a set of witness paths that justify an
answer, it is possible to provide an explanation of why the
answer to a dataflow query has the value reported.
Weighted pushdown systems are also a useful formalism in a completely
separate application domain, namely, access control of shared computing resources
in distributed systems
[JR02,
SJRS03,
JR04,
JSWR06,
WJRSS06].
The algorithms described in
[RSJ03,
RSJM05], along with
the improvements described in
[LRB05,
LR06,
LKRT07,
LR08b],
have been implemented in two libraries for use in creating
programanalysis tools
[KRML04,
KLR07].
These libraries have been used to create prototype implementations of
contextsensitive interprocedural dataflow analyses for uninitialized
variables, live variables, linear constant propagation, and the
detection of affine relationships, as well as many other problems.
[Back to the top]
In our work on static analysis of stripped executables, we devised
methods that—without relying on symboltable or debugging
information—recover intermediate representations that are
similar to those that the intermediate phases of a compiler create for a
program written in a highlevel language.
The goal is to provide a platform that an analyst can
use to understand the workings of COTS components, plugins, mobile
code, and DLLs, as well as memory snapshots of worms and
virusinfected code.
Starting from a stripped executable, our algorithms identify variables, data
objects, and types, and track how they are manipulated by the
executable. This information provides answers to such questions as
"what could this dereference operation access?", "what function
could be called at this indirect call site?", etc.
Our initial work concerned x86 executables.
In 2003, my student Gogul Balakrishnan and I developed a
staticanalysis algorithm for x86 machine code that is able to track
the flow of values through memory
[BR04].
We incorporated this algorithm, along with algorithms to identify
variables, data objects, and types, in CodeSurfer/x86
[BGRT05]—the
first programslicing tool for machine code that can help
with understanding dependences across memory updates and memory
accesses.
Later, we extended CodeSurfer/x86 to create the first automatic
programverification tool for stripped executables: it allows one to
check that a stripped executable conforms to an APIusage rule
(specified as a finitestate machine)
[BRKLLMGYCT05, BR08].
More recently, our work on the TSL system
has allowed us to create similar systems for other instruction sets
automatically, starting from a specification of the concrete
semantics of a machinecode instruction set.
Static analysis of machine code was the subject of Gogul
Balakrishnan's Ph.D. dissertation
[B07].
See also our ACM TOPLAS paper about the work
[BR10].
(The 2004 paper about the work [BR04]
received the ETAPS BestPaper Award for 2004 from the
European Association for Programming Languages and Systems (EAPLS).)
[Back to the top]
In 2008, my student Junghee Lim and I created the TSL system
[LR08a,
LR12],
a metatool to help in the creation of tools for analyzing machine
code. (TSL stands for Transformer Specification
Language.)
From a single specification of the concrete
semantics of a machinecode instruction set, TSL automatically
generates correctbyconstruction implementations of the
statetransformation functions needed in
statespaceexploration tools that use static analysis, dynamic
analysis, symbolic analysis, or any combination of the three
[LR08a,
LLR09a,
LLR10,
TLLBDEAR10].
The TSL language is a strongly typed, firstorder functional language
with a datatypedefinition mechanism for defining recursive datatypes,
plus deconstruction by means of pattern matching. Writing a TSL
specification for an instruction set is similar to writing an
interpreter in firstorder ML. For instance, the specification of an
instruction set's concrete semantics is written as a TSL function
state interpInstr(instruction I, state S) {...};
where instruction and state are userdefined
datatypes that represent the instructions and the semantic states,
respectively.
TSL's metalanguage provides a fixed set of basetypes; a fixed set of
arithmetic, bitwise, relational, and logical operators; and a facility
for defining maptypes. The meanings of the metalanguage constructs can be
redefined by supplying alternative interpretations of them.
When semantic reinterpretation is performed in this
way—namely, on the operations of the metalanguage—it
is independent of any given instruction set. Consequently,
once a reinterpretation has been defined that reinterprets
TSL in a manner appropriate for some statespaceexploration
method, the same reinterpretation can be applied to
each instruction set whose semantics has been specified
in TSL.
The reinterpretation mechanism causes TSL to act as a
"Yacclike" tool for automatically generating correctbyconstruction
implementations of the statetransformation functions needed in
statespaceexploration tools.
Moreover, because all components are generated from
a single specification of the concrete
semantics of an instruction set,
the generated implementations are guaranteed to be mutually
consistent (and also to be consistent with an instructionset emulator
that is generated from the same specification of the concrete
semantics).
TSL was the subject of Junghee Lim's Ph.D. dissertation
[L11].
(The 2008 paper about TSL [LR08a]
received the ETAPS BestPaper Award for 2008 from the
European Association for Programming Languages and Systems (EAPLS).)
[Back to the top]
Since 2002, I have been interested in symbolic methods for abstract
interpretation. The work aims to bridge the gap between (i) the use of logic
for specifying program semantics and performing program analysis, and
(ii) abstract interpretation
[RSY04,
YRS04,
TR12a,
TR12b,
TER12,
ELSAR13,
TLLR13].
In [RSY04],
Mooly Sagiv, Greta Yorsh, and I gave the first method that makes
such a connection: we showed that, for a broad class of situations, given
a formula φ in some logic L, it is possible to find the best value a in
a given abstract domain Abs that overapproximates the
meaning of φ (i.e., [[φ]] ⊆ γ(a)).
We call this problem symbolic abstraction.
If you think of logic L as being a rich language for
expressing meanings, and abstract domain Abs as
corresponding to a logic fragment L' that is less
expressive than L, then symbolic abstraction addresses a
fundamental approximation problem:
Given φ ∈ L, find the strongest
consequence of φ that is expressible in logic fragment
L'.
From 2011 to 2016, my student Aditya Thakur and I pursued the
insight that generalized versions of an old, and not widely used, method
for validity checking of propositionallogic formulas, called
Stålmarck's method, hold great promise for symbolic abstraction.
The symbolic methods for abstract interpretation that we have
subsequently developed based on this insight
[TR12a,
TR12b,
TER12,
TLLR13]
offer much promise for building more powerful programanalysis tools.
These methods (i) allow more precise implementations
of abstractinterpretation primitives to be created, and (ii) drastically
reduce the time taken to implement such primitives because correctness
is ensured by construction.
In particular, they provide a way to create correctbyconstruction abstract
interpreters that also attain the fundamental limits on precision that
abstractinterpretation theory establishes.
For instance, in [TLLR13],
we described a systematic method for implementing an abstract interpreter
that solves the following problem:
Given program P, and an abstract domain A,
find the mostprecise inductive Ainvariant for P.
Moreover, our generalizedStålmarck methods are "dualuse":
In addition to providing methods for building improved
abstractinterpretation tools, they also provide methods for building
improved logic solvers that use abstract interpretation to speed up
the search that a solver carries out.
In particular, we showed how Stålmarck's method is an instantiation of
a generic framework parameterized by an abstract domain over Booleans;
consequently, different instantiations of the framework lead to new
decision procedures for propositional logic
[TR12b].
Moreover, this abstractionbased view allows us to lift
Stålmarck's method from propositional logic to richer logics:
to obtain a method for richer logics, instantiate the parameterized
version of Stålmarck's method with richer abstract domains
[TR12a].
We call such a decisionprocedure design, which is parameterized by an abstract
domain, a Satisfiability Modulo Abstraction (SMA) solver
[RT14].
Recently, we created an SMA solver for separation logic
[TBR14].
Separation logic (SL) is an expressive logic for reasoning about
heap structures in programs, and provides a mechanism for concisely describing program
states by explicitly localizing facts that hold in separate regions of
the heap.
SL is undecidable in general, but by using an abstract domain of
shapes [SRW02]
we were able to design a semidecision procedure for SL.
The design of algorithms for symbolic abstraction and their
application was the topic of Aditya Thakur's Ph.D. dissertation
[T14].
[Back to the top]
Recently, Esparza et al. generalized Newton's method—the classical
numericalanalysis algorithm for finding roots of realvalued
functions—to a method for finding fixedpoints of systems of
equations over semirings. Their method provides a new way to solve
interprocedural dataflowanalysis problems. As in its realvalued
counterpart, each iteration of their method solves a simpler
``linearized'' problem.
However, there is an important difference between the
dataflowanalysis and numericalanalysis contexts: when Newton's
method is used on numericalanalysis problems, multiplicative
commutativity is relied on to rearrange expressions of the form
``a * X * b'' into ``X * a * b.'' Equations of the form
``X = X * a * b'' correspond to path problems described by regular
languages. In contrast, when Newton's method is used for
interprocedural dataflow analysis, the ``multiplication'' operation
involves function composition, and hence is noncommutative:
``X = a * X * b'' cannot be rearranged into ``X = X * a * b,'' and
thus the equations correspond to path problems described by linear
contextfree languages (LCFLs).
What we contributed in
[RTP17,
RTP16]
is a way to convert each LCFL problem into a
regularlanguage problem. Given that the language {a^{i}b^{i}} is the
canonical example of an LCFL that is not regular, this transformation
sounds impossible. The secret is that we require the
semiring to support a few additional operations (what we call
``transpose,'' ``tensor product,'' and ``detensor'').
Moreover, one has such operations for the
socalled ``predicateabstraction problems'' (an important family of
dataflowanalysis problems used in essentially all modernday software
model checkers). In predicateabstraction problems,
 a semiring value is a square Boolean matrix
 the product operation is Boolean matrix multiplication
 the sum operation is pointwise "or"
 transpose (denoted by ^{t}) is matrix transpose, and
 tensor product is Kronecker product of square Boolean matrices
The key step is to take each equation righthand side of the form
``X = a1 * X * b1 + a2 * X * b2'' and turn it into ``Z = Z * (a1^{t} tensor b1) + Z * (a2^{t} tensor b2).'' The
reason this works is that linear paths in the Z system mimic
derivation trees in the X system, e.g.,
(a1^{t} tensor b1) * (a2^{t} tensor b2) = (a1^{t} * a2^{t}) tensor (b1 * b2)
= (a2 * a1)^{t} tensor (b1 * b2)
^ ^ ^ ^
   
 ++ 
++
which produces the kind of mirrored symmetry that we need to properly
track the values that arise in the X system, which have such symmetric
correlations. (More precisely, the detensor operation performs
detensor(a^{t} tensor b) = a * b,
so that
detensor((a2 * a1)^{t} tensor (b1 * b2)) = a2 * a1 * b1 * b2,
which has the mirrored symmetry found in the values of derivation
trees in the X system.)
The Z system's paths encode all and only the derivation trees of the X
system. To use this idea to solve an LCFL path problem precisely,
there is one further requirement: the ``detensor'' operation must
distribute over the tensoredsum operation. This property causes the
detensor of the sumoverallZpaths to equal the desired
sumoverallXtreevaluations.
It turns out that such a distributive detensor operation exists for
Kronecker products of Boolean matrices, and thus all the pieces fit
together for the predicateabstraction problems.
In addition to the use in Newtonian Program Analysis, we know of two
other applications of the rearrangement trick:
It was originally used in
LTKR08
in a method for contextbounded analysis of concurrent programs;
although we used a different method in our work on abstract error projection
LKRT07,
the rearrangement trick could have been used, and would have improved the result.
Zak Kincaid, Jason Breck, John Cyphert, and I have continued this line of research, and
developed a number of additional techniques that have been incorporated in a system
called ICRA (Interprocedural Compositional Recurrence Analysis),
which finds invariants in numeric programs
[KBFBR17,
KCBR18,
KBCR19,
CBKR19,
BCKR20,
KRC21].
That work introduces versions of ``transpose,'' ``tensor product,'' and ``detensor''
for use with logical formulas.
[Back to the top]
Program Synthesis
Program synthesis refers to the task of finding a program within a
given search space that meets a given behavioral specification
(typically a logical formula or a set of inputoutput examples).
The general problem is as follows:
Given a specification S that relates input values and output
values, create a program P that satisfies S—i.e., for all
inputs x, the input/output pair (x, P(x)) satisfies S.
The benefit of creating software via a synthesis tool is that when
the tool produces a program that satisfies S, the program is
correct by construction.
In our work, we are developing a new synthesis framework, called ``semanticsguided synthesis''
(SemGuS) that attempts to be both languageagnostic and solveragnostic.
The goal of SemGuS is to provide a general, logical framework that expresses
the core computational problem of program synthesis, without being tied to
a specific solution or implementation.
What the SemGuS work brings to the table is the idea of
parameterizing the core synthesis algorithm with the semantics
of the primitives out of which the target program is to be
constructed—hence the name ``semanticsguided synthesis.''
In addition to a syntactic search space and a behavioral
specification, SemGuS also allows the user to define the semantics of
constructs in the grammar in terms of a set of inference rules.
The SemGuS framework is described in
[KHDR21,
DHKR21],
and the SemGuS page is here: semgus.org.
SemGuS is another nice illustration of the benefit of developing a
solution in the form of a framework: quoting Richard Hamming, it
allows the work to address "all of next year's problems, not just the
one [currently] in front of [your] face"
[H86, p. 7].
Along the way, we obtained new results on how to show that a programsynthesis
problem is ``unrealizable'' (i.e., the problem has no solution).
While many program synthesizers are good at finding a solution when one exists,
they typically fail, or do not terminate, when the given synthesis problem
is unrealizable.
The lack of knowledge about this problem was an important lacuna in the field:
the key to an efficient synthesis algorithm is pruning of the search space;
in particular, when synthesis subproblems are unrealizable, one wants
that part of the search space to be pruned so that the synthesizer will never again
look for a solution there—because none is to be found.
Our work has started to fill this gap by introducing multiple approaches
for proving that synthesis problems are unrealizable
[HBCDR19,
HCDR21b].
We also developed techniques for synthesizing programs that meet a
userspecified asymptoticcomplexity bound
[HCDR21a].
[Back to the top]
Model Checking Concurrent Programs
In contextbounded analysis, a program's behavior is explored for a
fixed number of context switches; however, running threads are allowed to
perform an a priori unbounded amount of work between context
switches. Contextbounded analysis was introduced by Shaz Qadeer and
Jakob Rehof; however, their algorithm had two drawbacks: (i) it was limited
to Boolean abstract domains (i.e., predicateabstraction domains), and
(ii) it performed certain enumerations of reachable configurations
during the process of statespace exploration. In 2008, Akash Lal,
Tayssir Touili, Nicholas Kidd, and I showed how symbolic methods
could be employed in contextbounded analysis. In particular,
we gave transducerbased techniques that permit the
explicitenumeration step of the QadeerRehof algorithm to be avoided
[LTKR08].
Akash Lal and I showed how any contextboundedanalysis problem can be
tranformed into an equivalent sequential problem; after applying the
transformation, any existing tool for sequential analysis can then be
applied [LR08c,
LR09].
When we compared the performance of the latter analysis
technique against a stateoftheart tool for verifying
concurrent device drivers (Daniel Kroening's DDVerify,
using SMV as its analysis engine), we found that the
median improvement obtained was a 30fold
speedup! (See Figure 7.9 on page 205 of
[L09].)
Contextbounded analysis was the subject of Akash Lal's
Ph.D. dissertation
[L09].
For this work, Akash was a corecipient of the 2009 ACM SIGPLAN
John C. Reynolds Doctoral Dissertation Award.
Akash was also named to MIT Technology Review's
2011 India TR35 list.
In 2023,
Akash and I, along with M. Musuvathi, S. Qadeer, and J. Rehof,
received the 2023 CAV Award
for the ``introduction of contextbounded analysis and its application to systematic testing of concurrent programs.''
[Back to the top]
Our work on detecting atomicity violations focused on the problem
of detecting data races that may involve multiple data locations.
Problems can arise when a program is supposed to maintain an invariant
over multiple data locations, but accesses/updates are not protected
correctly by locks.
The tool that we built analyzes Java programs.
Originally, the analysis engine in the tool employed a semidecision
procedure; however, after introducing two problem transformations
that were motivated by the desire to reduce the imprecision caused by the
abstractions that the tool applies
[KLR08,
KRDV09],
we discovered that for the model of concurrent programs
we were working with—which permits
(i) reentrant locks,
(ii) an unbounded number of context switches, and (iii) an
unbounded number of lock acquisitions—the problem is decidable
[KLTR09].
(The couplings between different processes are sufficiently weak
that the undecidability of checking emptiness of the intersection
of two contextfree languages does not come into play.)
When we compared the performance of the decision procedure against
the semidecision procedure, we found that the decision procedure
was 34 times faster overall (i.e., comparing the sum of the
running times on all queries, with a 300second timeout).
Moreover, with the 300second timeout threshold, for each query
where the semidecision procedure timed out (roughly 49% of
the queries), the decison procedure succeeded within the
allotted time.
Detection of atomicity violations was the subject of
Nicholas Kidd's Ph.D. dissertation
[K09].
[Back to the top]
Policy Weaving
System security is a crosscutting issue.
Unfortunately, it is hard to devise a formalism that allows
systemwide policies to be both (i) easily understood and (ii) easily
implemented and maintained. In particular, when the elements of
security policies are expressed directly in the code—e.g., via
calls on security primitives, as in Decentralized Information Flow
Control (DIFC) systems, such as Asbestos, HiStar, and Flume, or in
capabilitybased systems, such as Capsicum—the
policy implementation is spread throughout the codebase of an
application. With such a structure, it is difficult to understand what
overall policy the application actually implements, which complicates
documenting the policy, debugging the policy, and updating the policy
as the codebase evolves in response to changes in requirements and the
threat landscape.
In contrast, the focus of our work is on a declarative approach to system security.
In our approach, a policy specification is a firstclass notion—e.g.,
the policy specification for a given application is a single
document, so that all elements of the policy are gathered together in
one place, rather than scattered throughout the codebase.
A security policy is still implemented in terms of a collection of
code snippets to invoke policyenforcement primitives at different
points in the program.
However, users are freed from the task of introducing such snippets
manually.
The goal of our work is to create tools to support policy
weaving, which will handle the job of rewriting the program to place
code at appropriate locations so that suitable security primitives are
invoked at appropriate moments to enforce the policy.
A policyweaving tool takes as input
(i) a program P, and
(ii) a policy R (written in some suitable policyspecification language),
and rewrites P to enforce R.
A policyweaving tool supports a declarative approach to security.
The policy R can be stated as a single, easilyunderstood
document, and it is the obligation of the policyweaving tool to find
a way to place code for invoking enforcement primitives so that all
executions of P satisfy R.
We also work on metatools to support policy weaving.
In particular, one of our goals is to create a policyweaver generator.
Such a tool takes as input specifications of
 the semantics of a programming language L
 the semantics of the policyenforcement primitives to be used
and generates a policy weaver.
Our vision is similar to aspectoriented programming (in that a
crosscutting issue is specified separately and
then automatically woven into the fabric of the
implementation).
However, the approach that we envision goes beyond
aspectoriented programming in an important way: whereas conventional
aspectoriented programming systems use syntacticallyoriented
mechanisms and allow aspect code to be placed in a rather limited set
of locations, our approach is based on programminglanguage and
policy semantics to identify automatically
the program locations where policyenforcement primitives need to be placed.
Our approach also goes beyond simple finitestate reference monitors:
policies have visibility into such entities as
processes, principals, and communication channels.
A key challenge is to design a policy weaver that can reason about the
potentiallyunbounded set of states that a program may reach, and to
choose system primitives and parameters from a potentially unbounded
number of options to keep the program from reaching states that constitute a
violation of the policy provided by the programmer.
Our approach to designing a weaver generator is based on the theory of
twoplayer safety games.
Such a game is played by two players, called the Attacker and the
Defender, on a graph in which each node is labeled as an Attacker
node or a Defender node.
The state of the game is a node of the graph.
The goal of the Attacker is to reach a state labeled as an
accepting node;
the goal of the Defender is to thwart the Attacker by preventing
the Attacker from ever reaching an accepting node.
The game is played in turns:
on each turn, if the state is an Attacker state, then the Attacker
can change the state to any adjacent node of the graph, and
analogously for Defender states and Defenderinitiated state changes.
A winning Defender strategy for a game is a
decision tree of choices that the Defender can make so that an
accepting state is never reached.
The core of a policy weaver created by our weaver generator is an
algorithm to find a winning Defender strategy
[HJR12].
A policy weaver generated by the policyweaver generator works as follows:
given program P and policy R, the policy weaver constructs a game in
which
 the Attacker represents the threat model (i.e., all possible system attacks),
 the Defender represents possible instrumentations of the program, and
 accepting nodes represent system states reached when the instrumented program
violates the policy.
If the weaver finds a winning Defender strategy for the game, it uses
the strategy to instrument the program to satisfy the policy.
We have also been working on applying policy weaving to JavaScript
programs.
JavaScript is the most popular language used to create dynamic content
for web browsers.
Moreover, the nature of the Web has resulted in an ecosystem in which
JavaScript programs and libraries, often written by untrusted third
parties, are regularly incorporated into applications and combined in
every way imaginable.
While this practice has obvious benefits in terms of convenience and
not ``reinventing the wheel,'' it creates glaring security holes because
of a complete lack of isolation between modules in a JavaScript
environment.
We chose this context to apply our policyweaving techniques because
of the obviously unfilled need for a security mechanism, and because
of the challenge presented by JavaScript's highly dynamic, untyped
nature.
We have created the the JAM policy weaver [FJJRPSY12],
coupled with the JAMScript enforcement mechanism.
(``JAM'' stands for JAvascipt Model checker.)
JAM can be used by a website administrator to apply a security
policy—in the form of a finitestate automaton—that
specifies a set of disallowed execution traces of a JavaScript
program.
The program is analyzed statically with respect to the policy, and an
instrumented program that is guaranteed to not violate the policy
specification is produced.
This analysis consists of a counterexampleguided,
abstractionrefinement loop that repeatedly creates an incrementally
more precise abstract model of the program.
Because this model represents a superset of the program's actual
behaviors, we can say that any execution traces that are not
realizable in the model also cannot happen in the concrete program.
Static analysis can be expensive, especially in the context of a
highly dynamic language such as JavaScript.
Thus, the algorithm used in JAM provides for a systemresource cutoff threshold,
which terminates the analysis when the threshold is reached.
This case still results in a safely instrumented program, although the
program may contain some amount of spurious instrumentation, resulting
in extra runtime overhead.
In this way, JAM can be tuned to balance the tradeoff between
staticanalysis overhead and dynamic overhead.
The instrumentation that JAM weaves into a JavaScript program comes in
the form of transactions.
Potentially malicious statements in the untrusted code (i.e., those
that could not be proven safe by the static analysis) are wrapped in
transaction blocks, which are speculatively executed in a manner that
records all actions (reads, writes, and function calls) and suppresses
all effects of the actions.
This approach allows the effects of the statements to be evaluated
with respect to the security policy, at runtime inside the transaction.
The program is terminated—without the effects of the actions
being visible externally—if it is determined that any prefix
of the sequence of actions will violate the policy.
If no policyviolating action sequence is observed, the transaction is
committed, and execution continues as before.
[Back to the top]
Incremental Computing
The attributeupdating algorithm that I developed as part of my
Ph.D. research
[R82,
RTD83,
R84]
is an example of an "incremental updating algorithm"—an
algorithm that makes use of the solution to one problem instance
I_{1} to find the solution to a nearby problem instance
I_{2}, where I_{2} is created by making
a small change to problem instance I_{1}.
Over the years, I have encountered such problems in a variety of
contexts, including
interactive programdevelopment environments
[R82,
RTD83,
R84,
RMT86,
R88],
dynamic graph problems
[RR94a,
RR94b,
RR96a,
RR96b],
the construction of abstract transformers for programanalysis
problems
[RSL03,
RSL10,
EGR10],
and propagation strategies in dataflowanalysis solvers
[RSJM05].
My interest in incremental computing also led me to look at ways of
transforming a program that computes a function
F(x) on floatingpoint numbers into programs that
compute good approximations to the derivative F'(x), the
first divided difference
F[x_{0},x_{1}], Taylorseries
expansions of F(x), and higherorder divided differences of
F(x), as well as multivariate versions of such transformations
[RR00,
RR03]
(see computational differentiation and computational divided differencing).
Incremental computing was the subject of G. Ramalingam's
Ph.D. dissertation
[R96].
The incremental algorithm for the dynamic shortesthyperpath
problem in directed hypergraphs that my student G. Ramalingam and I
developed [RR96b]
has been used by AT&T to improve the performance of Open Shortest Path
First (OSPF), the most commonly used Internet intradomain routing
protocol.
The incremental shortesthyperpath algorithm is used as a subroutine
in a heuristic search algorithm for optimizing OSPF weights (for a
given set of projected demands) and thereby reduce congestion
[FT04].
The speedup obtained by using our incremental algorithm was reported to be a factor of
15
to
20.
In 2004, Mikkel Thorup told me that AT&T was using their results to adjust
capacities in the AT&T network when they had to schedule link downtimes
for maintenance. Apparently this happens quite frequently: he indicated
that on the order of a dozen links had to be serviced each day.
More recently, the ideas used in [RR96b]
were combined with the A^{*} search algorithm from the field of AI
planning to create several
incremental versions of A^{*} search.
These algorithms were used to solve planning problems and generate dynamically
constrained trajectories for the fullsize SUV of CMU's Tartan Racing Team,
which won the DARPA 2007 Urban Challenge race for autonomous vehicles.
[Back to the top]
Computational Differentiation and Computational Divided Differencing
Tools for computational differentiation transform a program that
computes a numerical function F(x) into a related program that
computes F'(x) (the derivative of F).
Louis Rall (of the UW Mathematics Department) and I
described how techniques similar to those used in computationaldifferentiation
tools can be used to implement other program transformations—in
particular, a variety of transformations for computational divided
differencing
[RR00,
RR03],
including multivariate versions of such transformations.
[Back to the top]
Representing Boolean and PseudoBoolean Functions
In summer 2018, Meghana Sistla
visited the University of Wisconsin, supported by the
S.N. Bose Scholars Program,
and we worked to resurrect an idea of mine from the late 90s that had been
rejected by the USPTO
in 2002.
The idea was for a plugcompatible replacement for BDDs for
representing Boolean and pseudoBoolean functions, called CFLOBDDs
(for ``ContextFree Language Ordered Binary Decision Diagrams'').
The key insight behind CFLOBDDs is that subdecisiondiagrams can be
treated as standalone ``procedures,'' which allows additional reuse
opportunities compared to BDDs.
In the best case, one can obtain doubleexponential compression
compared to decision trees—i.e., an extra exponential factor over (quasi)BDDs.
Meghana joined Swarat Chaudhuri's
research group at UTAustin, and we have been continuing the work with Swarat.
25 years ago, I never found a compelling application where CFLOBDDs beat BDDs,
but in the last few years we have found that CFLOBDDs can be quite good for
simulating quantum circuits [SCR24].
The implementation is available at this
[link].
Our Quasimodo
tool for quantum simulation, which supports BDDs, Weighted BDDs, and CFLOBDDs,
is available on GitHub at this link
[link],
and described in a CAV23 tool paper
[SCR23a].
We have also developed a weighted version of CFLOBDDs, called WCFLOBDDs
[SCR23b].
[Back to the top]
