Thomas W. Reps -- Past Research Accomplishments
 
   
   

 
Computer Sciences Dept.

Thomas W. Reps

Honorary Fellow
J. Barkley Rosser Professor & Rajiv and Ritu Batra Chair Emeritus

Computer Sciences Department
University of Wisconsin-Madison
1210 West Dayton Street
Madison, WI 53706-1685
USA

Picture of Thomas W. Reps

Past Research Accomplishments

Table of Contents

Capsule Summaries

During my 46-year career in Computer Science, the overriding goal of my research has been to find new ways to help programmers create correct, reliable, and secure software. To address these issues, I developed a number of new methods to manipulate programs and analyze their properties. Often these methods have taken the form of frameworks that apply to general classes of problems, e.g.,

  • creating interactive program-development environments [RT88]
  • solving graph-reachability problems [R98]
  • analyzing programs that manipulate linked data structures [SRW02]
  • solving interprocedural dataflow-analysis problems [RSJM05]
  • analyzing machine code [LR08a]
  • synthesizing programs [KHDR21].
As Richard Hamming has put it, the benefit of developing a solution in the form of a framework is that it allows the work to address "all of next year's problems, not just the one [currently] in front of [your] face" [H86, p. 7].

My research interests have spanned program-development environments, software engineering, dataflow analysis, and more recently, computer security, software verification, and program synthesis, with some excursions into other subjects, as well. The key achievements can be summarized as follows:

  • In 1978-79, Tim Teitelbaum and I created the Cornell Program Synthesizer [TR81], the first example of an interactive program-development environment for a strongly-typed, block-structured, imperative programming language. It provided integrated, screen-oriented facilities to create, edit, execute, monitor, single-step, and debug programs. The Cornell Program Synthesizer was similar to modern program-development environments, such as Microsoft Visual Studio and Eclipse, but pre-dated them by over two decades. [More]

  • From 1980 to 1987, Tim Teitelbaum and I worked on methods to create programming tools similar to the Cornell Program Synthesizer automatically, but for languages other than the PL/I subset that the Cornell Program Synthesizer supported. We developed an editor-generator tool, the Synthesizer Generator [RT88], which automatically creates an editor for a given language L from a description of L—including a description of the properties that the editor should check incrementally as L-documents are developed. Among other applications, the Synthesizer Generator was used to create some of the first screen-oriented, interactive, automated proof assistants [RA84, G88, GMP90]. [More]

  • In 1988, Susan Horwitz and I, together with our student David Binkley, showed how to perform context-sensitive interprocedural slicing [HRB88]. Our work reignited interest in program slicing in the research community, which continues to this day. It also led to the development of a number of commercial program-slicing products. Commercial slicers include GrammaTech's CodeSurfer for C and C++, which, under a license from the Wisconsin Alumni Research Foundation, is derived from the ANSI C slicer that we developed at Wisconsin, and IBM's Rational Transformation Workbench, which was developed to address such code-renovation issues as the Year 2000 Problem ("Y2K bug") and conversion of financial software to support adoption of the Euro. [More]

  • In the early 90s, my student G. Ramalingam and I investigated incremental algorithms for dynamic graph problems. The goal was to use the solution to one problem instance G1 to find the solution to a nearby problem instance G2, where G2 is created from G1 by inserting or deleting a small number of nodes or edges. In 1992, we developed an incremental algorithm for the dynamic shortest-hyperpath problem in directed hypergraphs [RR96b]. The algorithm has been used by AT&T in their Open Shortest Path First (OSPF) Internet intra-domain routing protocol, achieving a 15-fold to 20-fold speed-up in the heuristic search performed by the protocol.

    More recently, the ideas used in [RR96b] were combined with the A* search algorithm from the field of AI planning to create several incremental versions of A* search. These algorithms were used to solve planning problems and generate dynamically constrained trajectories for the full-size SUV of CMU's Tartan Racing Team, which won the DARPA 2007 Urban Challenge race for autonomous vehicles. [More]

  • From 1993 to 1997, Susan Horwitz, Mooly Sagiv, and I showed that the problem of context-free-language reachability (CFL-reachability) in directed graphs is a key concept in program analysis [R98]. (CFL-reachability is the variant of reachability in which a path from node s to node t only counts as a valid s-t connection if the path's labels form a word in a given context-free language.) We showed that a large class of program-analysis problems could be solved by transforming them into instances of CFL-reachability. The idea is now a standard concept in the programming-languages community, and a CFL-reachability solver is one of the key analysis components found in most systems for software model checking and bug detection. [More]

  • In 1999, Mooly Sagiv, Reinhard Wilhelm, and I developed a tunable framework for analyzing and establishing properties of programs that manipulate linked data structures [SRW02]. The work addressed the problem of analyzing programs that use dynamic allocation and freeing of storage cells and destructive updating of structure fields. In such programs, data structures can grow and shrink dynamically, with no fixed upper bound on their size or number. Our framework provides a way to create finite-sized descriptors of memory configurations that (i) abstract away certain details, but (ii) retain enough key information so that an analyzer can identify interesting node-linkage properties that hold (a.k.a. "shape" properties) [LARSW00, JLRS04, LRS06, JLRS10].

    Our work galvanized the programming-languages research community to finally tackle the problem of pointers, aliasing, and linked data structures in program-analysis and program-verification tools, and ignited an enormous effort on the topic over the last ten years. [More]

  • In 2003, Somesh Jha, Stefan Schwoon, and I developed a new framework for interprocedural dataflow analysis that supports a strictly richer set of queries than previous approaches to interprocedural dataflow-analysis (a problem that has a 35-year history). The algorithms that we developed [RSJM05], along with improvements developed with my students Akash Lal and Nicholas Kidd [LRB05, LR06, LKRT07, LR08b], have been implemented in two libraries for use in creating program-analysis tools [KRML04, KLR07]. [More]

  • In 2003, my student Gogul Balakrishnan and I developed a static-analysis algorithm for x86 machine code that—without relying on symbol-table or debugging information—is able to track the flow of values through memory [BR04, BR10]. We used it to create the first program-slicing tool for machine code that could help with understanding dependences across memory updates and memory accesses [BGRT05]. Later, we extended the tool to create the first automatic program-verification tool for stripped executables: it allows one to check that a stripped executable conforms to an API-usage rule (specified as a finite-state machine) [BRKLLMGYCT05, BR08]. [More]

  • In 2008, my student Junghee Lim and I created the TSL system [LR08a, LR12]. (TSL stands for Transformer Specification Language.) From a single specification of the concrete semantics of a machine-code instruction set, TSL automatically generates correct-by-construction implementations of the state-transformation functions needed in state-space-exploration tools that use static analysis, dynamic analysis, symbolic analysis, or any combination of the three [LR08a, LLR09, LLR10, TLLBDEAR10]. The generated implementations are guaranteed to be mutually consistent, and also to be consistent with an instruction-set emulator that is generated from the same specification of the concrete semantics. [More]

  • From 2005 to 2009, my students Akash Lal, Nicholas Kidd, and I developed methods for analyzing properties of concurrent programs, via model checking. (Our collaborators included Sagar Chaki, Ed Clarke, Julian Dolby, Peter Lammich, Tayssir Touili, and Mandana Vaziri.) We investigated two different kinds of model-checking problems, and for each problem the technique that we developed was about 30 times faster than the best previous technique for the problem.

    One line of research focused on "context-bounded analysis": a program's behavior is explored for a fixed number of context switches (but running threads are allowed to perform an a priori unbounded amount of work between context switches). We obtained several fundamental results, including a way in which any context-bounded-analysis problem can be transformed into an equivalent sequential problem (so that existing tools for sequential analysis can then be applied) [LR08c, LR09]. Another line of research focused on the problem of detecting data races that may involve multiple data locations. (Problems can arise when a program is supposed to maintain an invariant over multiple data locations, but accesses/updates are not protected correctly by locks.) We showed that for a model of concurrent programs that permits (i) reentrant locks, (ii) an unbounded number of context switches, and (iii) an unbounded number of lock acquisitions, the multi-location data-race-detection problem is decidable [KLTR09]. [More]

  • Since 2002, I have been interested in an interesting connection between abstract interpretation and logic—in particular, how to harness decision procedures to obtain algorithms for several fundamental primitives used in abstract interpretation [RSY04, YRS04, TR12a, TR12b, TER12, ELSAR13, TLLR13]. In [RSY04], Mooly Sagiv, Greta Yorsh, and I gave the first method that makes such a connection: we showed that, for a broad class of situations, given a formula φ in some logic L, it is possible to find the best value a in a given abstract domain Abs that over-approximates the meaning of φ (i.e., [[φ]] ⊆ γ(a)).

    From 2011 to 2016, my student Aditya Thakur and I pursued several new insights on this fundamental question. Several of the methods that we have developed are "dual-use". That is, in addition to providing methods for building improved abstract-interpretation tools, they also provide methods for building improved logic solvers that use abstract interpretation to speed up the search that a solver carries out. [More]

  • From 2010 to 2014, Somesh Jha, Bill Harris, Rich Joiner, and I worked on policy weaving, a declarative approach to system security. In our approach, a policy specification is a first-class notion—e.g., the policy specification for a given application is a single document, so that all elements of the policy are gathered together in one place, rather than scattered throughout the codebase. A security policy is implemented in terms of a collection of code snippets that invoke policy-enforcement primitives at different points in the program; however, users are freed from the task of introducing such snippets manually by a policy-weaving tool, which takes as input (i) a program P, and (ii) a policy R (written in some suitable policy-specification language), and rewrites P to enforce R. [More]

  • Starting in 2011, Emma Turetsky, Prathmesh Prabhu, and I worked on a method to improve the dataflow-analysis algorithm of Esparza et al., which provides a way to apply Newton's Method to program-analysis problems. They gave a new way to find fixed-points of systems of equations over semirings. As in its real-valued counterpart, each iteration of their method solves a simpler ``linearized'' problem. However, there is an important difference between the dataflow-analysis and numerical-analysis contexts: when Newton's method is used on numerical-analysis problems, multiplicative commutativity is relied on to rearrange expressions of the form ``a * X * b'' into ``X * a * b.'' Equations of the form ``X = X * a * b'' correspond to path problems described by regular languages. In contrast, when Newton's method is used for interprocedural dataflow analysis, the ``multiplication'' operation involves function composition, and hence is non-commutative: ``X = a * X * b'' cannot be rearranged into ``X = X * a * b,'' and thus the equations correspond to path problems described by linear context-free languages (LCFLs).

    We used some algebraic sleight-of-hand to turn a class of linear context-free (LCFL) path problems into regular-language path problems [RTP17, RTP16]. This result is surprising because a reasonable sanity check—formal-language theory—suggests that it should be impossible: after all, the LCFL languages are a strict superset of the regular languages.

    Zak Kincaid, Jason Breck, John Cyphert, and I have continued this line of research, and developed a number of additional techniques [KBFBR17, KCBR18, KBCR19, CBKR19, BCKR20, KRC21]. [More]

  • Since 2019, I have been collaborating with Loris D'Antoni and his students (Qinheping Hu, Jinwoo Kim, Kanghee Park, and others) on a new framework for program synthesis, which we call ``semantics-guided synthesis'' (SemGuS). The SemGuS framework captures the core computational problem of program synthesis, without being tied to a specific solution, implementation, or domain of application. That is, it is both domain-agnostic and solver-agnostic. What the SemGuS work brings to the table is the idea of parameterizing the core synthesis algorithm with the semantics of the primitives out of which the target program is to be constructed.

    To achieve this goal, it is necessary to distill out the essence of many program-synthesis problems into a specification formalism that is grounded in formal methods (e.g., automata and logic) and is agnostic to any specific domain of application. This degree of abstraction also opens the opportunity to lift certain synthesis algorithms and ideas to a higher level that makes these algorithms reusable across different tools. [More]

  • Meghana Sistla, Swarat Chaudhuri, and I are working on a data structure that can provide highly compressed representations of Boolean and pseudo-Boolean functions. Our data structure, called a CFLOBDD, is a decision diagram—and thus related to a Binary Decision Diagram (BDD)—but with one feature that is quite different than anything found in BDDs and variants of BDDs. A BDD can be considered to be a special form of bounded-size, branching, but non-looping program. From that standpoint, a CFLOBDD can be considered to be a bounded-size, branching, but non-looping program in which a certain form of procedure call is permitted. In the best case, CFLOBDDs can obtain double-exponential compression compared to decision trees—i.e., an extra exponential factor over (quasi-)BDDs. Even when such best-case compression is not possible, such ``procedure calls'' can allow there to be additional sharing of structure beyond what is possible in BDDs: a BDD can only share sub-DAGs, whereas a procedure call in a CFLOBDD can also share the ``middle of a DAG.'' We have found that CFLOBDDs can be quite good for simulating quantum circuits. [More]

Along the way, we made several contributions to the field of incremental computation [RTD83, RR94a, RR94b, RR96a, RR96b, RSL03, RSL10, EGR10]; introduced canonical abstraction, a new family of abstractions for use in program analysis and verification [SRW02]; made numerous innovations in techniques for program analysis [RR95, HRS95, RT96, MR98, RBRSW05, LRB05, LR06, BR06, GR07, LKRT07, LTKR08, LR08b, KLR08, LR08c, KRDV09, LR09, KLTR09]; developed techniques to convert C programs to C++ programs that take advantage of the improved features of C++, such as templates and classes; worked on methods to address the Year 2000 problem; and identified several interesting connections between program-analysis problems and ideas from databases [R94, RSL03, RSL10, EGR10], machine learning [RSY04, YRS04, LRS05, TER12], and access control of shared computing resources in distributed systems [SJRS03, JR04, JSWR06].

[Back to the top]


Interactive Program-Development Environments

The Cornell Program Synthesizer

The Cornell Program Synthesizer [TR81], which Tim Teitelbaum and I started working on in 1978, was the first example of an interactive program-development environment with integrated, screen-oriented facilities for a strongly-typed, block-structured, imperative programming language. It had a tightly integrated combination of facilities for programmers to create, edit, execute, monitor, single-step, and debug their PL/C programs. In other words, by 1979, we had created the first version of an interactive programming environment similar to modern tools, such as Microsoft Visual Studio and Eclipse.

At the time that it was created (1978-79), the Cornell Program Synthesizer was a radically new kind of programming tool. Along with the Mentor system of Donzeau-Gouge, Kahn, Huet, and Lang [DGKHL80], it showed that the kind of direct-manipulation programming tools that had been pioneered in the SmallTalk and InterLisp systems could be created for a mainstream programming language (i.e., one that was strongly-typed, block-structured, and had imperative programming constructs).

The Cornell Program Synthesizer was used successfully in introductory programming classes by about 20,000 students over a five-year period at Cornell, Rutgers, Princeton, UCLA, and several other universities.

Technology Transfer

The Cornell Program Synthesizer was distributed to researchers at about 110 universities, research institutes, and industry laboratories. Through lectures that were given about the system and copies that were distributed, it is possible to trace its direct influence on several commercial products, including DEC's Language Sensitive Editor, Rational's Ada programming environment, and Symantec's LightspeedC and LightspeedPascal programming environments. It is more difficult to connect the dots to the similar products that hundreds of thousands or millions of people use on a daily basis today; however, together with the initial paper about the system [TR81], the Cornell Program Synthesizer was very influential: as a model for programming tools of the future, it helped to spark widespread interest in the question of how to create better tools for developing, analyzing, testing, and debugging programs.

[Back to the top]

The Synthesizer Generator

My Ph.D. dissertation [R84], which was honored with the 1983 ACM Doctoral Dissertation Award, investigated the problem of automating the creation of programming tools similar to the Cornell Program Synthesizer, but for languages other than PL/I. The goal of the research was to create an editor-generator tool to enable editors for different languages to be created automatically from a language description. In the solution I proposed, language descriptions are specified by giving

  • a context-free grammar for the textual input language,
  • a regular-tree grammar for the language's underlying abstract syntax,
  • sets of "attributes" (annotations) to be attached to nodes of the language's abstract-syntax trees,
  • rules that describe how information from one node's attributes affects the attributes of other nodes, and
  • pretty-printing rules that describe how abstract-syntax trees and their attributes are displayed.
In essence, each editor can be thought of as a specialized "spreadsheet" for documents written in the given language. After an editing operation changes the document's underlying abstract-syntax tree, some of the tree's attributes may no longer be consistent with one another; attributes must be updated to reestablish consistency (similar to the way values are updated in a spreadsheet program). A major result of my Ph.D. research was an optimal algorithm that I devised to solve the attribute-updating problem [R82, RTD83, R84].

As part of my Ph.D. research, I developed a prototype implementation of an editor-generating system that made use of the ideas described above. I spent 1982-85 as a post-doc at Cornell (in residence during 1982-83 at INRIA, the national institute for computer science in France) collaborating with Tim Teitelbaum on the design and implementation of the Synthesizer Generator, a full-scale version of such a system [RT84, RT87, RT88a]. The Synthesizer Generator creates a language-specific editor from an input specification that defines a language's syntax, display format, transformation rules for restructuring "documents" written in the language, and attribution rules that codify context-sensitive relationships between document elements. From this specification, the Synthesizer Generator creates a specialized editor for manipulating documents according to these rules. The Synthesizer Generator has proven to be of great utility in a wide range of applications, including program editors, text-preparation systems, and tools that verify the correctness of proofs (for several different kinds of mathematical and programming logics [RA84, G88, GMP90]).

In 2010, our 1984 paper about the Synthesizer Generator [RT84] received an ACM SIGSOFT Retrospective Impact Paper Award.

Technology Transfer

Beginning in August 1985, the Synthesizer Generator was made available to universities and industrial research organizations. Since distribution began it has been licensed to about 330 sites in 23 countries. Two books describing the system, which Teitelbaum and I co-authored, were published by Springer-Verlag in 1988 [RT88a, RT88b]. Also in 1988, Teitelbaum and I founded a company (GrammaTech, Inc.) to commercialize the Synthesizer Generator technology. GrammaTech subsequently created two products for the Ada programming language based on the Synthesizer Generator, Ada-ASSURED and Ada-Utilities, and, since the mid-90's, these have been used in the development of almost every major Department of Defense system written in Ada.

[Back to the top]


Software Engineering

Program Slicing

Starting in 1986, I carried out a great deal of work exploring how program slicing (an operation originally proposed in 1979 by the late Mark Weiser) can serve as the basis for semantics-based program-manipulation operations. The slice of a program with respect to a set of program elements S is a projection of the program that includes only program elements that might affect (either directly or transitively) the values of the variables used at members of S. Slicing allows one to find semantically meaningful decompositions of programs, where the decompositions consist of elements that are not textually contiguous. Program slicing is a fundamental operation that can aid in solving many software-engineering problems [HR92]. For instance, it has applications in program understanding, maintenance, debugging, testing, differencing [HR91], specialization [RT96], reuse, and merging of program variants [HPR89]. The projects that I carried out with my students and co-workers were aimed at

(Click here for the home page of the Wisconsin Program-Slicing Project.)

The conference paper on interprocedural slicing that Susan Horwitz, David Binkley, and I wrote [HRB88] was selected for inclusion in a special SIGPLAN collection of the 50 most influential papers from the SIGPLAN Conference on Programming Language Design and Implementation from 1979 to 1999:

  • Horwitz, S., Reps, T., and Binkley, D., Interprocedural slicing using dependence graphs. 20 Years of the ACM SIGPLAN Conference on Programming Language Design and Implementation (1979 - 1999): A Selection, K.S. McKinley, ed., ACM SIGPLAN Notices 39, 4 (April 2004), 232-243.

A journal version of [HRB88] appeared as [HRB90]. (Of all my papers, this is the most highly-cited; e.g., see Google Scholar and CiteseerX.)

A retrospective on [HRB88] was published as

  • Horwitz, S., Reps, T., and Binkley, D., Retrospective: Interprocedural slicing using dependence graphs. 20 Years of the ACM SIGPLAN Conference on Programming Language Design and Implementation (1979 - 1999): A Selection, K.S. McKinley, ed., ACM SIGPLAN Notices 39, 4 (April 2004), 229-231. [retrospective (PS); retrospective (PDF)]

In 2011, our 1994 paper "Speeding up slicing" [RHSR94] received an ACM SIGSOFT Retrospective Impact Paper Award.

In more recent work, my students, colleagues, and I have developed even better methods for program analysis and slicing based on a formalism called weighted pushdown systems.

Technology Transfer

In collaboration with Genevieve Rosay and Susan Horwitz, I built a system, called the Wisconsin Program-Slicing Tool, that implemented slicing of ANSI C programs. Although the Wisconsin Program-Slicing Tool was distributed from 1996-2000 for not-for-profit research purposes under license from the University of Wisconsin, it is no longer being distributed. However, there is a commercial tool available, named CodeSurfer, that is derived from the Wisconsin implementation and is available from GrammaTech, Inc., which licensed the technology from the Wisconsin Alumni Research Foundation (a not-for-profit foundation that uses royalty income from the technologies it licenses to support research at the University of Wisconsin).

GrammaTech has improved on the Wisconsin implementation considerably (both in terms of speed and space efficiency), and extended it to support C++. GrammaTech also offers CodeSurfer to academic researchers on very favorable terms. In 2007, Tavis Ormandy of Gentoo Linux said the following about CodeSurfer:

"Our team is responsible for investigating and verifying security problems in any of the 10,000+ open source programs we distribute. CodeSurfer has made a dramatic impact on the time it takes us to isolate and understand security flaws, allowing us to traverse complex and unfamiliar codebases easily and trace problems to their source."
A group at SAIC/NASA had this to say about CodeSurfer:
"Our group analyzes many mission-critical software projects to reduce defects and ensure that the software meets requirements. We conducted a formal study to see if CodeSurfer could improve our software inspections. We found that CodeSurfer reduced inspection time by an average of 30%. In addition, when using CodeSurfer the number of defects found more than doubled."

Slicing was also used in many commercial tools that were developed to address such code-renovation issues as the Year 2000 Problem ("Y2K bug") and conversion of financial software to support adoption of the Euro. One example is IBM's Rational Transformation Workbench. (A list of more than 20 such tools can be found here.)

[Back to the top]

Techniques for Software Renovation

My student Mike Siff and I investigated techniques for converting C programs to C++ programs that take advantage of the improved features of C++, such as templates and classes. For instance, in [SR96] we considered the problem of software generalization: Given a program component C, create a parameterized program component C' such that C' is usable in a wider variety of syntactic contexts than C. Furthermore, C' should be a semantically meaningful generalization of C; namely, there must exist an instantiation of C' that is equivalent in functionality to C. We gave an algorithm that generalizes C functions via type inference: the original functions operate on specific data types; the result of generalization is a collection of C++ function templates that operate on parameterized types.

We also described a general technique for identifying modules in legacy code [SR99]. The method is based on concept analysis—a branch of lattice theory that can be used to identify similarities among a set of objects based on their attributes. We showed how concept analysis can identify potential modules using both "positive" and "negative" information, and presented an algorithmic framework to construct a lattice of concepts from a program, where each concept represents a potential module.

The development of better techniques for software renovation was the subject of Mike Siff's Ph.D. dissertation [S98].

[Back to the top]

The Use of Program Profiling for Software Maintenance

In 1996, I was asked by the Defense Advanced Research Projects Agency (DARPA) to plan a project aimed at reducing the impact of the Year 2000 problem on the Department of Defense. DARPA was particularly interested in whether there were "any techniques in the research community that could be applied to the Y2K problem and have impact beyond present commercial Y2K products and services." I spent a semester as a kind of talent scout for DARPA to see what ideas might be worth pursuing. The most exciting of the ideas that turned up concerned a method for using path profiling as a heuristic to locate some of the sites in a program where there are problematic date manipulations [RBDL97] (and more generally as an aid for testing and debugging).

A path profiler instruments a program so that the number of times each different loop-free path executes is accumulated during an execution run. With such an instrumented program, each run of the program generates a path spectrum for the execution—a distribution of the paths that were executed during that run. A path spectrum is a finite, easily obtainable characterization of a program's execution on a dataset, and provides a behavior signature for a run of the program.

Our techniques are based on the idea of comparing path spectra from different runs of the program. When different runs produce different spectra, the spectral differences can be used to identify paths in the program along which control diverges in the two runs. By choosing input datasets to hold all factors constant except one, the divergence can be attributed to this factor. The point of divergence itself may not be the cause of the underlying problem, but provides a starting place for a programmer to begin his exploration.

One application of this technique is in the "Year 2000 Problem" (i.e., the problem of fixing computer systems that use only 2-digit year fields in date-valued data). In this context, path-spectrum comparison provides a heuristic for identifying paths in a program that are good candidates for being date-dependent computations.

A few years later, we found out that essentially the same idea had been discovered earlier by J. Collofello and L. Cousin [CC87], as well as by N. Wilde (who has done extensive research on the subject, and uses the term "software reconnaissance", e.g., see [W94, WC96]).

[Back to the top]


Dataflow Analysis

Interprocedural Dataflow Analysis and Context-Free-Language Reachability

Work that I carried out with Susan Horwitz, Mooly Sagiv, and David Melski established some interesting connections between the work on interprocedural program slicing and a number of other program-analysis problems, such as classical interprocedural dataflow analysis. In particular, we showed that a large class of program-analysis problems can be solved by transforming them into instances of a special kind of graph-reachability problem, called context-free-language reachability (CFL-reachability) [RHS95, HRS95, R95b, R96, MR98, R98, R00]. CFL-reachability is the variant of reachability in which a path from node s to node t only counts as a valid s-t connection if the path's labels form a word in a given context-free language.

The notion of CFL-reachability was originated by Mihalis Yannakakis, who developed it in the context of graph-theoretic methods in database theory. Context-free-language reachability problems can be solved in polynomial time by algorithms similar to the ones we originally developed for interprocedural slicing [HRB90, RHSR94].

David Melski and I studied the relationship [MR98] between CFL-reachability and the program-analysis formalism of set constraints, which was introduced by John Reynolds and subsequently studied by Alex Aiken and Nevin Heintze.

The PowerPoint presentation prepared for the tutorial that I gave on this material at PLDI '00 can be found here.

In more recent work, my students, colleagues, and I have developed even better methods for interprocedural dataflow analysis based on a formalism called weighted pushdown systems.

Technology Transfer

In addition to the CFL-reachability-based slicing algorithms that are used in CodeSurfer, the work on program analysis via CFL-reachability had direct impact on many systems for software model checking and bug detection: a CFL-reachability solver is one of the key analysis components used in most such systems. Among them is the well-known SLAM tool from Microsoft, which checks C programs against temporal safety properties to find defects in Windows device drivers—thereby addressing one of Windows' main weak points. In particular, as described in [BR01], one of the key algorithms used in SLAM is based on the interprocedural dataflow-analysis algorithm given in [RHS95]. Due to the importance of the problem that it addressed, the SLAM work even got on Bill Gates's radar screen. For instance, in a keynote address at the Windows Hardware Engineering Conference in 2002, Gates touted the work on SLAM as follows:

"… software verification … has been the Holy Grail of Computer Science for many decades, but now in some very key areas, for example, driver verification, we're building tools that can do actual proof about the software and how it works in order to guarantee the reliability."

(See also Gates's address at OOPSLA 2002.) Thus, as one of the key technologies in a tool that has helped to make Windows much more stable in recent years, CFL-reachability has had a significant impact on the computing experience of hundreds of millions of people worldwide.

[Back to the top]

Static Program Analysis Via 3-Valued Logic (Shape Analysis)

The technique for static program analysis using 3-valued logic that Mooly Sagiv, Reinhard Wilhelm, and I developed [SRW02, RSW04] addresses the problem of analyzing programs that use dynamic allocation and freeing of storage cells and destructive updating of structure fields—features found in all modern programming languages, including C, C++, and Java. In such programs, data structures can grow and shrink dynamically, with no fixed upper bound on their size or number. In the case of thread-based languages, such as Java, the number of threads can also grow and shrink dynamically.

We developed a way to create finite-sized descriptors of memory configurations that (i) abstract away certain details, but (ii) retain enough key information so that the analysis can identify interesting properties that hold. We used these ideas to create a framework for program analysis that is capable of expressing rich properties of memory configurations, and confirming such behavioral properties as

  • when the input to a list-insert program is an acyclic list, the output is an acyclic list, and
  • when the input to a list-reversal program that uses destructive-update operations is an acyclic list, the output is an acyclic list.
The framework can be instantiated in different ways to create different program-analysis algorithms that provide answers to different questions, with varying degrees of efficiency and precision.

The work was originally motivated by the problem of shape analysis for programs written in languages that permit destructive updating of dynamically allocated storage. The aim of shape analysis is to discover information (in a conservative way) about the possible "shapes" of the dynamically allocated data structures to which a program's pointer variables can point. As we investigated this problem more deeply, we discovered methods that allowed us to deal with properties of memory configurations other than just pure shape properties. For instance, recent work has concerned numeric properties [GDDRS04].

The key aspect of our approach is the way in which it makes use of 2-valued and 3-valued logic: 2-valued and 3-valued logical structures—i.e., collections of relations—are used to represent concrete and abstract stores, respectively; individuals represent entities such as memory cells, threads, locks, etc.; unary and binary relations encode the contents of variables, pointer-valued structure fields, and other aspects of memory states; and first-order formulas with transitive closure are used to specify properties such as sharing, cyclicity, reachability, etc. Formulas are also used to specify how the store is affected by the execution of the different kinds of statements in the programming language.

The analysis framework can be instantiated in different ways by varying the relation symbols of the logic, and, in particular, by varying which of the unary relations control how nodes are folded together. The specified set of relations determines the set of properties that will be tracked, and consequently what properties of stores can be discovered to hold at different points in the program by the corresponding instance of the analysis.

As a methodology for verifying properties of programs, the advantages of the 3-valued-logic approach are:

  1. No loop invariants are required.
  2. No theorem provers are involved, and thus every abstract execution step must terminate.
  3. The method is based on abstract interpretation and satisfies conditions that guarantee that the entire process always terminates.
  4. The method applies to programs that manipulate pointers and heap-allocated data structures.
  5. The method eliminates the need for the user to write the usual proofs required with abstract interpretation—i.e., to demonstrate that the abstract descriptors that the analyzer manipulates correctly model the actual heap-allocated data structures that the program manipulates.
A PowerPoint presentation on some of this material can be found here.

A prototype implementation that implements this approach has been created, called TVLA (Three-Valued-Logic Analyzer).

In addition to [SRW02, RSW04], some of the other key papers about this work include those on interprocedural shape analysis [JLRS04, JLRS08], numeric properties together with shape properties [GDDRS04], applications to program verification [LRSW00, LRS06, LRS07, ARRSY07], the use of ideas from incremental computing and databases to construct abstract transformers [RSL03, RSL10, EGR10], and the use of machine learning to identify appropriate abstractions [LRS05].

The Ph.D. dissertations of Alexey Loginov [L06] and Denis Gopan [G07] developed improved techniques for static analysis based on 3-valued logic.

[Back to the top]

Interprocedural Dataflow Analysis and Weighted Pushdown Systems

The framework for interprocedural dataflow analysis that Somesh Jha, Stefan Schwoon, and I developed, based on weighted pushdown systems [RSJM05], supports a strictly richer set of queries than previous approaches to interprocedural dataflow-analysis (a problem that has a 35-year history). In general, the approach can be applied to any distributive dataflow-analysis problem for which the domain of transfer functions has no infinite descending chains (including problems that cannot be expressed as bit-vector problems, such as linear constant propagation [SRH96] and affine-relation analysis [MOS04]). Safe solutions are also obtained for problems that are monotonic but not distributive.

The benefits of the approach can be summarized as follows:

  • Conventional dataflow-analysis algorithms merge together the values for all states associated with the same program point, regardless of the states' calling context. With the dataflow-analysis algorithm obtained via WPDSs, dataflow queries can be posed with respect to a regular language of stack configurations. (Conventional merged dataflow information can also be obtained by issuing appropriate queries.)
  • Because the algorithm for solving weighted path problems on WPDSs can provide a set of witness paths that justify an answer, it is possible to provide an explanation of why the answer to a dataflow query has the value reported.

Weighted pushdown systems are also a useful formalism in a completely separate application domain, namely, access control of shared computing resources in distributed systems [JR02, SJRS03, JR04, JSWR06, WJRSS06].

Technology Transfer

The algorithms described in [RSJ03, RSJM05], along with the improvements described in [LRB05, LR06, LKRT07, LR08b], have been implemented in two libraries for use in creating program-analysis tools [KRML04, KLR07]. These libraries have been used to create prototype implementations of context-sensitive interprocedural dataflow analyses for uninitialized variables, live variables, linear constant propagation, and the detection of affine relationships, as well as many other problems.

[Back to the top]

Analysis of Executables

In our work on static analysis of stripped executables, we devised methods that—without relying on symbol-table or debugging information—recover intermediate representations that are similar to those that the intermediate phases of a compiler create for a program written in a high-level language. The goal is to provide a platform that an analyst can use to understand the workings of COTS components, plugins, mobile code, and DLLs, as well as memory snapshots of worms and virus-infected code.

Starting from a stripped executable, our algorithms identify variables, data objects, and types, and track how they are manipulated by the executable. This information provides answers to such questions as "what could this dereference operation access?", "what function could be called at this indirect call site?", etc. Our initial work concerned x86 executables. In 2003, my student Gogul Balakrishnan and I developed a static-analysis algorithm for x86 machine code that is able to track the flow of values through memory [BR04]. We incorporated this algorithm, along with algorithms to identify variables, data objects, and types, in CodeSurfer/x86 [BGRT05]—the first program-slicing tool for machine code that can help with understanding dependences across memory updates and memory accesses.

Later, we extended CodeSurfer/x86 to create the first automatic program-verification tool for stripped executables: it allows one to check that a stripped executable conforms to an API-usage rule (specified as a finite-state machine) [BRKLLMGYCT05, BR08].

More recently, our work on the TSL system has allowed us to create similar systems for other instruction sets automatically, starting from a specification of the concrete semantics of a machine-code instruction set.

Static analysis of machine code was the subject of Gogul Balakrishnan's Ph.D. dissertation [B07]. See also our ACM TOPLAS paper about the work [BR10]. (The 2004 paper about the work [BR04] received the ETAPS Best-Paper Award for 2004 from the European Association for Programming Languages and Systems (EAPLS).)

[Back to the top]

Transformer Specification Language (TSL)

In 2008, my student Junghee Lim and I created the TSL system [LR08a, LR12], a meta-tool to help in the creation of tools for analyzing machine code. (TSL stands for Transformer Specification Language.) From a single specification of the concrete semantics of a machine-code instruction set, TSL automatically generates correct-by-construction implementations of the state-transformation functions needed in state-space-exploration tools that use static analysis, dynamic analysis, symbolic analysis, or any combination of the three [LR08a, LLR09a, LLR10, TLLBDEAR10].

The TSL language is a strongly typed, first-order functional language with a datatype-definition mechanism for defining recursive datatypes, plus deconstruction by means of pattern matching. Writing a TSL specification for an instruction set is similar to writing an interpreter in first-order ML. For instance, the specification of an instruction set's concrete semantics is written as a TSL function

        state interpInstr(instruction I, state S) {...};
where instruction and state are user-defined datatypes that represent the instructions and the semantic states, respectively.

TSL's meta-language provides a fixed set of base-types; a fixed set of arithmetic, bitwise, relational, and logical operators; and a facility for defining map-types. The meanings of the meta-language constructs can be redefined by supplying alternative interpretations of them. When semantic reinterpretation is performed in this way—namely, on the operations of the meta-language—it is independent of any given instruction set. Consequently, once a reinterpretation has been defined that reinterprets TSL in a manner appropriate for some state-space-exploration method, the same reinterpretation can be applied to each instruction set whose semantics has been specified in TSL.

The reinterpretation mechanism causes TSL to act as a "Yacc-like" tool for automatically generating correct-by-construction implementations of the state-transformation functions needed in state-space-exploration tools. Moreover, because all components are generated from a single specification of the concrete semantics of an instruction set, the generated implementations are guaranteed to be mutually consistent (and also to be consistent with an instruction-set emulator that is generated from the same specification of the concrete semantics).

TSL was the subject of Junghee Lim's Ph.D. dissertation [L11]. (The 2008 paper about TSL [LR08a] received the ETAPS Best-Paper Award for 2008 from the European Association for Programming Languages and Systems (EAPLS).)

[Back to the top]

Symbolic Methods for Abstract Interpretation

Since 2002, I have been interested in symbolic methods for abstract interpretation. The work aims to bridge the gap between (i) the use of logic for specifying program semantics and performing program analysis, and (ii) abstract interpretation [RSY04, YRS04, TR12a, TR12b, TER12, ELSAR13, TLLR13]. In [RSY04], Mooly Sagiv, Greta Yorsh, and I gave the first method that makes such a connection: we showed that, for a broad class of situations, given a formula φ in some logic L, it is possible to find the best value a in a given abstract domain Abs that over-approximates the meaning of φ (i.e., [[φ]] ⊆ γ(a)).

We call this problem symbolic abstraction. If you think of logic L as being a rich language for expressing meanings, and abstract domain Abs as corresponding to a logic fragment L' that is less expressive than L, then symbolic abstraction addresses a fundamental approximation problem:

Given φ ∈ L, find the strongest consequence of φ that is expressible in logic fragment L'.

From 2011 to 2016, my student Aditya Thakur and I pursued the insight that generalized versions of an old, and not widely used, method for validity checking of propositional-logic formulas, called Stålmarck's method, hold great promise for symbolic abstraction. The symbolic methods for abstract interpretation that we have subsequently developed based on this insight [TR12a, TR12b, TER12, TLLR13] offer much promise for building more powerful program-analysis tools. These methods (i) allow more precise implementations of abstract-interpretation primitives to be created, and (ii) drastically reduce the time taken to implement such primitives because correctness is ensured by construction. In particular, they provide a way to create correct-by-construction abstract interpreters that also attain the fundamental limits on precision that abstract-interpretation theory establishes. For instance, in [TLLR13], we described a systematic method for implementing an abstract interpreter that solves the following problem:

Given program P, and an abstract domain A, find the most-precise inductive A-invariant for P.

Moreover, our generalized-Stålmarck methods are "dual-use":

In addition to providing methods for building improved abstract-interpretation tools, they also provide methods for building improved logic solvers that use abstract interpretation to speed up the search that a solver carries out.
In particular, we showed how Stålmarck's method is an instantiation of a generic framework parameterized by an abstract domain over Booleans; consequently, different instantiations of the framework lead to new decision procedures for propositional logic [TR12b]. Moreover, this abstraction-based view allows us to lift Stålmarck's method from propositional logic to richer logics: to obtain a method for richer logics, instantiate the parameterized version of Stålmarck's method with richer abstract domains [TR12a]. We call such a decision-procedure design, which is parameterized by an abstract domain, a Satisfiability Modulo Abstraction (SMA) solver [RT14].

Recently, we created an SMA solver for separation logic [TBR14]. Separation logic (SL) is an expressive logic for reasoning about heap structures in programs, and provides a mechanism for concisely describing program states by explicitly localizing facts that hold in separate regions of the heap. SL is undecidable in general, but by using an abstract domain of shapes [SRW02] we were able to design a semi-decision procedure for SL.

The design of algorithms for symbolic abstraction and their application was the topic of Aditya Thakur's Ph.D. dissertation [T14].

[Back to the top]

Newton's Method for Program Analysis

Recently, Esparza et al. generalized Newton's method—the classical numerical-analysis algorithm for finding roots of real-valued functions—to a method for finding fixed-points of systems of equations over semirings. Their method provides a new way to solve interprocedural dataflow-analysis problems. As in its real-valued counterpart, each iteration of their method solves a simpler ``linearized'' problem.

However, there is an important difference between the dataflow-analysis and numerical-analysis contexts: when Newton's method is used on numerical-analysis problems, multiplicative commutativity is relied on to rearrange expressions of the form ``a * X * b'' into ``X * a * b.'' Equations of the form ``X = X * a * b'' correspond to path problems described by regular languages. In contrast, when Newton's method is used for interprocedural dataflow analysis, the ``multiplication'' operation involves function composition, and hence is non-commutative: ``X = a * X * b'' cannot be rearranged into ``X = X * a * b,'' and thus the equations correspond to path problems described by linear context-free languages (LCFLs).

What we contributed in [RTP17, RTP16] is a way to convert each LCFL problem into a regular-language problem. Given that the language {aibi} is the canonical example of an LCFL that is not regular, this transformation sounds impossible. The secret is that we require the semiring to support a few additional operations (what we call ``transpose,'' ``tensor product,'' and ``detensor''). Moreover, one has such operations for the so-called ``predicate-abstraction problems'' (an important family of dataflow-analysis problems used in essentially all modern-day software model checkers). In predicate-abstraction problems,

  • a semiring value is a square Boolean matrix
  • the product operation is Boolean matrix multiplication
  • the sum operation is pointwise "or"
  • transpose (denoted by t) is matrix transpose, and
  • tensor product is Kronecker product of square Boolean matrices
The key step is to take each equation right-hand side of the form ``X = a1 * X * b1 + a2 * X * b2'' and turn it into ``Z = Z * (a1t tensor b1) + Z * (a2t tensor b2).'' The reason this works is that linear paths in the Z system mimic derivation trees in the X system, e.g.,
  
  (a1t tensor b1) * (a2t tensor b2) = (a1t * a2t) tensor (b1 * b2)
                                    = (a2 * a1)t tensor (b1 * b2)
                                       ^    ^             ^    ^
                                       |    |             |    |
                                       |    +-------------+    |
                                       +-----------------------+
which produces the kind of mirrored symmetry that we need to properly track the values that arise in the X system, which have such symmetric correlations. (More precisely, the detensor operation performs
  detensor(at tensor b) = a * b,
so that
  detensor((a2 * a1)t tensor (b1 * b2)) = a2 * a1 * b1 * b2,
which has the mirrored symmetry found in the values of derivation trees in the X system.)

The Z system's paths encode all and only the derivation trees of the X system. To use this idea to solve an LCFL path problem precisely, there is one further requirement: the ``detensor'' operation must distribute over the tensored-sum operation. This property causes the detensor of the sum-over-all-Z-paths to equal the desired sum-over-all-X-tree-valuations.

It turns out that such a distributive detensor operation exists for Kronecker products of Boolean matrices, and thus all the pieces fit together for the predicate-abstraction problems.

In addition to the use in Newtonian Program Analysis, we know of two other applications of the rearrangement trick: It was originally used in LTKR08 in a method for context-bounded analysis of concurrent programs; although we used a different method in our work on abstract error projection LKRT07, the rearrangement trick could have been used, and would have improved the result.

Zak Kincaid, Jason Breck, John Cyphert, and I have continued this line of research, and developed a number of additional techniques that have been incorporated in a system called ICRA (Interprocedural Compositional Recurrence Analysis), which finds invariants in numeric programs [KBFBR17, KCBR18, KBCR19, CBKR19, BCKR20, KRC21]. That work introduces versions of ``transpose,'' ``tensor product,'' and ``detensor'' for use with logical formulas.

[Back to the top]


Program Synthesis

Program synthesis refers to the task of finding a program within a given search space that meets a given behavioral specification (typically a logical formula or a set of input-output examples). The general problem is as follows:

Given a specification S that relates input values and output values, create a program P that satisfies S—i.e., for all inputs x, the input/output pair (x, P(x)) satisfies S.
The benefit of creating software via a synthesis tool is that when the tool produces a program that satisfies S, the program is correct by construction.

In our work, we are developing a new synthesis framework, called ``semantics-guided synthesis'' (SemGuS) that attempts to be both language-agnostic and solver-agnostic. The goal of SemGuS is to provide a general, logical framework that expresses the core computational problem of program synthesis, without being tied to a specific solution or implementation. What the SemGuS work brings to the table is the idea of parameterizing the core synthesis algorithm with the semantics of the primitives out of which the target program is to be constructed—hence the name ``semantics-guided synthesis.'' In addition to a syntactic search space and a behavioral specification, SemGuS also allows the user to define the semantics of constructs in the grammar in terms of a set of inference rules. The SemGuS framework is described in [KHDR21, DHKR21], and the SemGuS page is here: semgus.org.

Along the way, we obtained new results on how to show that a program-synthesis problem is ``unrealizable'' (i.e., the problem has no solution). While many program synthesizers are good at finding a solution when one exists, they typically fail, or do not terminate, when the given synthesis problem is unrealizable. The lack of knowledge about this problem was an important lacuna in the field: the key to an efficient synthesis algorithm is pruning of the search space; in particular, when synthesis subproblems are unrealizable, one wants that part of the search space to be pruned so that the synthesizer will never again look for a solution there—because none is to be found. Our work has started to fill this gap by introducing multiple approaches for proving that synthesis problems are unrealizable [HBCDR19, HCDR21b].

We also developed techniques for synthesizing programs that meet a user-specified asymptotic-complexity bound [HCDR21a].

[Back to the top]


Model Checking Concurrent Programs

Context-Bounded Analysis

In context-bounded analysis, a program's behavior is explored for a fixed number of context switches; however, running threads are allowed to perform an a priori unbounded amount of work between context switches. Context-bounded analysis was introduced by Shaz Qadeer and Jakob Rehof; however, their algorithm had two drawbacks: (i) it was limited to Boolean abstract domains (i.e., predicate-abstraction domains), and (ii) it performed certain enumerations of reachable configurations during the process of state-space exploration. In 2008, Akash Lal, Tayssir Touili, Nicholas Kidd, and I showed how symbolic methods could be employed in context-bounded analysis. In particular, we gave transducer-based techniques that permit the explicit-enumeration step of the Qadeer-Rehof algorithm to be avoided [LTKR08]. Akash Lal and I showed how any context-bounded-analysis problem can be tranformed into an equivalent sequential problem; after applying the transformation, any existing tool for sequential analysis can then be applied [LR08c, LR09].

When we compared the performance of the latter analysis technique against a state-of-the-art tool for verifying concurrent device drivers (Daniel Kroening's DDVerify, using SMV as its analysis engine), we found that the median improvement obtained was a 30-fold speedup! (See Figure 7.9 on page 205 of [L09].)

Context-bounded analysis was the subject of Akash Lal's Ph.D. dissertation [L09]. For this work, Akash was a co-recipient of the 2009 ACM SIGPLAN John C. Reynolds Doctoral Dissertation Award. Akash was also named to MIT Technology Review's 2011 India TR-35 list. In 2023, Akash and I, along with M. Musuvathi, S. Qadeer, and J. Rehof, received the 2023 CAV Award for the ``introduction of context-bounded analysis and its application to systematic testing of concurrent programs.''

[Back to the top]

Detecting Atomicity Violations

Our work on detecting atomicity violations focused on the problem of detecting data races that may involve multiple data locations. Problems can arise when a program is supposed to maintain an invariant over multiple data locations, but accesses/updates are not protected correctly by locks. The tool that we built analyzes Java programs. Originally, the analysis engine in the tool employed a semi-decision procedure; however, after introducing two problem transformations that were motivated by the desire to reduce the imprecision caused by the abstractions that the tool applies [KLR08, KRDV09], we discovered that for the model of concurrent programs we were working with—which permits (i) reentrant locks, (ii) an unbounded number of context switches, and (iii) an unbounded number of lock acquisitions—the problem is decidable [KLTR09]. (The couplings between different processes are sufficiently weak that the undecidability of checking emptiness of the intersection of two context-free languages does not come into play.)

When we compared the performance of the decision procedure against the semi-decision procedure, we found that the decision procedure was 34 times faster overall (i.e., comparing the sum of the running times on all queries, with a 300-second timeout). Moreover, with the 300-second timeout threshold, for each query where the semi-decision procedure timed out (roughly 49% of the queries), the decison procedure succeeded within the allotted time.

Detection of atomicity violations was the subject of Nicholas Kidd's Ph.D. dissertation [K09].

[Back to the top]


Policy Weaving

System security is a cross-cutting issue. Unfortunately, it is hard to devise a formalism that allows system-wide policies to be both (i) easily understood and (ii) easily implemented and maintained. In particular, when the elements of security policies are expressed directly in the code—e.g., via calls on security primitives, as in Decentralized Information Flow Control (DIFC) systems, such as Asbestos, HiStar, and Flume, or in capability-based systems, such as Capsicum—the policy implementation is spread throughout the codebase of an application. With such a structure, it is difficult to understand what overall policy the application actually implements, which complicates documenting the policy, debugging the policy, and updating the policy as the codebase evolves in response to changes in requirements and the threat landscape.

In contrast, the focus of our work is on a declarative approach to system security. In our approach, a policy specification is a first-class notion—e.g., the policy specification for a given application is a single document, so that all elements of the policy are gathered together in one place, rather than scattered throughout the codebase. A security policy is still implemented in terms of a collection of code snippets to invoke policy-enforcement primitives at different points in the program. However, users are freed from the task of introducing such snippets manually.

The goal of our work is to create tools to support policy weaving, which will handle the job of rewriting the program to place code at appropriate locations so that suitable security primitives are invoked at appropriate moments to enforce the policy. A policy-weaving tool takes as input (i) a program P, and (ii) a policy R (written in some suitable policy-specification language), and rewrites P to enforce R. A policy-weaving tool supports a declarative approach to security. The policy R can be stated as a single, easily-understood document, and it is the obligation of the policy-weaving tool to find a way to place code for invoking enforcement primitives so that all executions of P satisfy R.

We also work on meta-tools to support policy weaving. In particular, one of our goals is to create a policy-weaver generator. Such a tool takes as input specifications of

  1. the semantics of a programming language L
  2. the semantics of the policy-enforcement primitives to be used
and generates a policy weaver.

Our vision is similar to aspect-oriented programming (in that a cross-cutting issue is specified separately and then automatically woven into the fabric of the implementation). However, the approach that we envision goes beyond aspect-oriented programming in an important way: whereas conventional aspect-oriented programming systems use syntactically-oriented mechanisms and allow aspect code to be placed in a rather limited set of locations, our approach is based on programming-language and policy semantics to identify automatically the program locations where policy-enforcement primitives need to be placed. Our approach also goes beyond simple finite-state reference monitors: policies have visibility into such entities as processes, principals, and communication channels. A key challenge is to design a policy weaver that can reason about the potentially-unbounded set of states that a program may reach, and to choose system primitives and parameters from a potentially unbounded number of options to keep the program from reaching states that constitute a violation of the policy provided by the programmer.

Our approach to designing a weaver generator is based on the theory of two-player safety games. Such a game is played by two players, called the Attacker and the Defender, on a graph in which each node is labeled as an Attacker node or a Defender node. The state of the game is a node of the graph. The goal of the Attacker is to reach a state labeled as an accepting node; the goal of the Defender is to thwart the Attacker by preventing the Attacker from ever reaching an accepting node. The game is played in turns: on each turn, if the state is an Attacker state, then the Attacker can change the state to any adjacent node of the graph, and analogously for Defender states and Defender-initiated state changes. A winning Defender strategy for a game is a decision tree of choices that the Defender can make so that an accepting state is never reached.

The core of a policy weaver created by our weaver generator is an algorithm to find a winning Defender strategy [HJR12]. A policy weaver generated by the policy-weaver generator works as follows: given program P and policy R, the policy weaver constructs a game in which

  • the Attacker represents the threat model (i.e., all possible system attacks),
  • the Defender represents possible instrumentations of the program, and
  • accepting nodes represent system states reached when the instrumented program violates the policy.
If the weaver finds a winning Defender strategy for the game, it uses the strategy to instrument the program to satisfy the policy.

We have also been working on applying policy weaving to JavaScript programs. JavaScript is the most popular language used to create dynamic content for web browsers. Moreover, the nature of the Web has resulted in an ecosystem in which JavaScript programs and libraries, often written by untrusted third parties, are regularly incorporated into applications and combined in every way imaginable. While this practice has obvious benefits in terms of convenience and not ``reinventing the wheel,'' it creates glaring security holes because of a complete lack of isolation between modules in a JavaScript environment. We chose this context to apply our policy-weaving techniques because of the obviously unfilled need for a security mechanism, and because of the challenge presented by JavaScript's highly dynamic, untyped nature.

We have created the the JAM policy weaver [FJJRPSY12], coupled with the JAMScript enforcement mechanism. (``JAM'' stands for JAvascipt Model checker.) JAM can be used by a website administrator to apply a security policy—in the form of a finite-state automaton—that specifies a set of disallowed execution traces of a JavaScript program. The program is analyzed statically with respect to the policy, and an instrumented program that is guaranteed to not violate the policy specification is produced. This analysis consists of a counterexample-guided, abstraction-refinement loop that repeatedly creates an incrementally more precise abstract model of the program. Because this model represents a superset of the program's actual behaviors, we can say that any execution traces that are not realizable in the model also cannot happen in the concrete program.

Static analysis can be expensive, especially in the context of a highly dynamic language such as JavaScript. Thus, the algorithm used in JAM provides for a system-resource cutoff threshold, which terminates the analysis when the threshold is reached. This case still results in a safely instrumented program, although the program may contain some amount of spurious instrumentation, resulting in extra runtime overhead. In this way, JAM can be tuned to balance the trade-off between static-analysis overhead and dynamic overhead.

The instrumentation that JAM weaves into a JavaScript program comes in the form of transactions. Potentially malicious statements in the untrusted code (i.e., those that could not be proven safe by the static analysis) are wrapped in transaction blocks, which are speculatively executed in a manner that records all actions (reads, writes, and function calls) and suppresses all effects of the actions. This approach allows the effects of the statements to be evaluated with respect to the security policy, at run-time inside the transaction. The program is terminated—without the effects of the actions being visible externally—if it is determined that any prefix of the sequence of actions will violate the policy. If no policy-violating action sequence is observed, the transaction is committed, and execution continues as before.

[Back to the top]


Incremental Computing

The attribute-updating algorithm that I developed as part of my Ph.D. research [R82, RTD83, R84] is an example of an "incremental updating algorithm"—an algorithm that makes use of the solution to one problem instance I1 to find the solution to a nearby problem instance I2, where I2 is created by making a small change to problem instance I1. Over the years, I have encountered such problems in a variety of contexts, including interactive program-development environments [R82, RTD83, R84, RMT86, R88], dynamic graph problems [RR94a, RR94b, RR96a, RR96b], the construction of abstract transformers for program-analysis problems [RSL03, RSL10, EGR10], and propagation strategies in dataflow-analysis solvers [RSJM05]. My interest in incremental computing also led me to look at ways of transforming a program that computes a function F(x) on floating-point numbers into programs that compute good approximations to the derivative F'(x), the first divided difference F[x0,x1], Taylor-series expansions of F(x), and higher-order divided differences of F(x), as well as multivariate versions of such transformations [RR00, RR03] (see computational differentiation and computational divided differencing).

Incremental computing was the subject of G. Ramalingam's Ph.D. dissertation [R96].

Technology Transfer

The incremental algorithm for the dynamic shortest-hyperpath problem in directed hypergraphs that my student G. Ramalingam and I developed [RR96b] has been used by AT&T to improve the performance of Open Shortest Path First (OSPF), the most commonly used Internet intra-domain routing protocol. The incremental shortest-hyperpath algorithm is used as a subroutine in a heuristic search algorithm for optimizing OSPF weights (for a given set of projected demands) and thereby reduce congestion [FT04]. The speed-up obtained by using our incremental algorithm was reported to be a factor of 15 to 20. In 2004, Mikkel Thorup told me that AT&T was using their results to adjust capacities in the AT&T network when they had to schedule link downtimes for maintenance. Apparently this happens quite frequently: he indicated that on the order of a dozen links had to be serviced each day.

More recently, the ideas used in [RR96b] were combined with the A* search algorithm from the field of AI planning to create several incremental versions of A* search. These algorithms were used to solve planning problems and generate dynamically constrained trajectories for the full-size SUV of CMU's Tartan Racing Team, which won the DARPA 2007 Urban Challenge race for autonomous vehicles.

[Back to the top]


Computational Differentiation and Computational Divided Differencing

Tools for computational differentiation transform a program that computes a numerical function F(x) into a related program that computes F'(x) (the derivative of F). Louis Rall (of the UW Mathematics Department) and I described how techniques similar to those used in computational-differentiation tools can be used to implement other program transformations—in particular, a variety of transformations for computational divided differencing [RR00, RR03], including multivariate versions of such transformations.

[Back to the top]


Representing Boolean and Pseudo-Boolean Functions

In summer 2018, Meghana Sistla visited the University of Wisconsin, supported by the S.N. Bose Scholars Program, and we worked to resurrect an idea of mine from the late 90s that had been rejected by the USPTO in 2002. The idea was for a plug-compatible replacement for BDDs for representing Boolean and pseudo-Boolean functions, called CFLOBDDs (for ``Context-Free Language Ordered Binary Decision Diagrams''). The key insight behind CFLOBDDs is that sub-decision-diagrams can be treated as standalone ``procedures,'' which allows additional reuse opportunities compared to BDDs. In the best case, one can obtain double-exponential compression compared to decision trees—i.e., an extra exponential factor over (quasi-)BDDs. Even when such best-case compression is not possible, the ability to have ``procedure calls'' in CFLOBDDs can allow there to be additional sharing of structure beyond what is possible in BDDs: a BDD can only share sub-DAGs, whereas a procedure call in a CFLOBDD can also share the ``middle of a DAG.''

Meghana joined Swarat Chaudhuri's research group at UT-Austin, and we have been continuing the work with Swarat. 25 years ago, I never found a compelling application where CFLOBDDs beat BDDs, but in the last few years we have found that CFLOBDDs can be quite good for simulating quantum circuits [SCR24a]. The implementation is available at this [link].

Our Quasimodo tool for quantum simulation—which supports BDDs, Weighted BDDs, and CFLOBDDs—is available on GitHub at this [link], and described in a CAV23 tool paper [SCR23].

We have also developed a weighted version of CFLOBDDs, called WCFLOBDDs [SCR24b].

[Back to the top]

 
Computer Sciences | UW Home