Thomas W. Reps -- Research Summary

My UW | UW Search

Computer Science Home Page
> ~reps

Contact
Information

Biography

Summary of Past Research

Recent Items of Note
(Last updated: 3/27/25)

Research Summary

My research concerns a core issue in Computer Science: how to develop correct, reliable, and secure software. To address these issues, my group works on creating better tools for manipulating programs and analyzing their execution properties. The work uses ideas that come primarily from the fields of programming languages, compilers, and software engineering, but there are connections to databases, machine learning, and even numerical analysis.

Whenever possible, our tools take the form of frameworks that apply to general classes of problems. The benefit of developing a solution in the form of a framework is that it allows the work to address "all of next year's problems, not just the one [currently] in front of [your] face" [H86, p. 7]. Four of the frameworks that we are currently working on concern

analyzing machine code
solving interprocedural dataflow-analysis problems
symbolic methods for abstract interpretation
weaving security policies into programs

Many of the tools that we work on are based on static analysis, which provides a way to obtain information about the possible states that a program reaches during execution, but without actually running the program on specific inputs. Instead, static-analysis techniques explore a program's behavior for all possible inputs and all possible states that the program can reach. To make this feasible, the program is "run in the aggregate"—i.e., on descriptors that represent collections of many states.

Most of our current efforts are focused on the following topics:

Static analysis of stripped executables
Transformer Specification Language (TSL)
Weighted pushdown systems
Symbolic methods for abstract interpretation
Policy weaving

A summary of some of my past research work, including work on frameworks for solving other classes of problems, can be found here.

In our work on static analysis of stripped executables, we have devised methods that—without relying on symbol-table or debugging information—identify variables, data objects, and types, and track how they are manipulated by the executable. This information provides answers to such questions as "what could this dereference operation access?", "what function could be called at this indirect call site?", etc., which can help human security analysts understand the workings of COTS components, plugins, mobile code, and DLLs, as well as memory snapshots of worms and virus-infected code. [More]

The TSL system (TSL stands for Transformer Specification Language) provides a framework for creating correct-by-construction implementations of the state-transformation functions needed in state-space-exploration tools that analyze machine code. From a single specification of the concrete semantics of a machine-code instruction set, TSL automatically generates state-transformation functions needed for static analysis, dynamic analysis, symbolic analysis, or any combination of the three. The generated implementations are guaranteed to be mutually consistent, and also to be consistent with an instruction-set emulator that is generated from the same specification of the concrete semantics. [More]

Our work on weighted pushdown systems led to a new framework for interprocedural dataflow analysis that supports a strictly richer set of queries than previous approaches to interprocedural dataflow-analysis (a problem that has a 35-year history). We have also shown that WPDSs are a useful formalism in a completely separate application domain, namely, access control of shared computing resources in distributed systems. [More]

Since 2002, I have been interested in an interesting connection between abstract interpretation and logic—in particular, how to harness decision procedures to obtain algorithms for several fundamental primitives used in abstract interpretation. In [RSY04], Mooly Sagiv, Greta Yorsh, and I gave the first method that makes such a connection: we showed that, for a broad class of situations, given a formula φ in some logic L, it is possible to find the best value a in a given abstract domain Abs that over-approximates the meaning of φ (i.e., [[φ]] ⊆ γ(a)). Since 2011, my student Aditya Thakur and I have been pursuing several new insights on this fundamental question. [More]

Our work on policy weaving focuses on a declarative approach to system security. In our approach, a policy specification is a first-class notion—e.g., the policy specification for a given application is a single document, so that all elements of the policy are gathered together in one place, rather than scattered throughout the codebase. A security policy is implemented in terms of a collection of code snippets that invoke policy-enforcement primitives at different points in the program; however, users are freed from the task of introducing such snippets manually by a policy-weaving tool, which takes as input (i) a program P, and (ii) a policy R (written in some suitable policy-specification language), and rewrites P to enforce R. [More]

Analysis of Executables

In our work on static analysis of stripped executables, we devised methods that—without relying on symbol-table or debugging information—recover intermediate representations that are similar to those the intermediate phases of a compiler creates for a program written in a high-level language. The goal is to provide a platform that an analyst can use to understand the workings of COTS components, plugins, mobile code, and DLLs, as well as memory snapshots of worms and virus-infected code. Our initial work concerned x86 executables. Starting from a stripped x86 executable, our algorithms identify variables, data objects, and types, and track how they are manipulated by the executable. This information provides answers to such questions as "what could this dereference operation access?", "what function could be called at this indirect call site?", etc.

This work required overcoming numerous obstacles. In particular, to track the flow of values through memory, it is necessary to determine the set of addresses accessed by each operation. This is difficult because

While some memory operations use explicit memory addresses in the instruction (easy), others use indirect addressing via address expressions (difficult).
Arithmetic on addresses is pervasive. For instance, even when the value of a local variable is loaded from its slot in an activation record, address arithmetic is performed.
There is no explicit notion of type at the hardware level: any integer value can be used as an address value.

An additional challenge is that for many programs, symbol-table and debugging information is entirely absent. Even if it is present, the executable may be untrustworthy, in which case symbol-table and debugging information cannot be relied upon.

In 2003, my student Gogul Balakrishnan and I developed a static-analysis algorithm for x86 machine code that is able to track the flow of values through memory [BR04]. The algorithm—called value-set analysis (VSA)—recovers information about the contents of memory locations and how they are manipulated by the executable [BR04, RBL06, BR07, B07, RB08, BR10].

A key feature of VSA is that it tracks address-valued and integer-valued quantities simultaneously. VSA is related to pointer-analysis algorithms that have been developed for programs written in high-level languages, which determine an over-approximation of the set of variables whose addresses each pointer variable can hold:

VSA determines an over-approximation of the set of addresses that each data object can hold at each program point.

At the same time, VSA is similar to range analysis and other numeric static-analysis algorithms that over-approximate the integer values that each variable can hold:

VSA determines an over-approximation of the set of integer values that each data object can hold at each program point.

We incorporated VSA, along with algorithms to identify variables, data objects, and types, in CodeSurfer/x86 [BGRT05]—the first program-slicing tool for machine code that can help with understanding dependences across memory updates and memory accesses.

CodeSurfer/x86 is the outcome of a joint project between the Univ. of Wisconsin and GrammaTech, Inc. CodeSurfer/x86 makes use of both IDAPro, a disassembly toolkit, and GrammaTech's CodeSurfer system, a toolkit for building program-analysis and inspection tools. CodeSurfer/x86 is a tool for code understanding and code inspection that supports both a graphical user interface (GUI) and a scripting language for accessing a program's system dependence graph (SDG) [HRB90], as well as other information stored in CodeSurfer's intermediate representations (IRs). The facilities of CodeSurfer/x86 provide an analyst with a powerful and flexible platform for investigating the properties and behaviors of an x86 executable, using either the GUI, the scripting language, or GrammaTech's Path Inspector, which is a tool that uses a sophisticated pattern-matching engine to answer questions about the flow of execution in a program.

The GUI supports browsing ("surfing") of an SDG, along with a variety of operations for making queries about the SDG—such as slicing [HRB90] and chopping [RR95]. The GUI allows a user to navigate through a program's source code using these dependences in a manner analogous to navigating the World Wide Web.

CodeSurfer/x86's scripting language provides a programmatic interface to these operations, as well as to lower-level information, such as the individual nodes and edges of the program's SDG, call graph, and control-flow graph, and a node's sets of used, killed, and possibly-killed variables. By writing programs that traverse CodeSurfer's IRs to implement additional program analyses, the scripting language can be used to extend CodeSurfer/x86's capabilities.

Later, we extended CodeSurfer/x86 to create the first automatic program-verification tool for stripped executables: it allows one to check that a stripped executable conforms to an API-usage rule (specified as a finite-state machine) [BRKLLMGYCT05, BR08].

More recently, our work on the TSL system has allowed us to create similar systems for other instruction sets automatically, starting from a specification of the concrete semantics of a machine-code instruction set.

Bibliography:

"Safety checking of machine code"
"Typestate checking of machine code"
"Safety checking of machine code" (Xu thesis)
"Analyzing memory accesses in x86 executables" (Awarded the EAPLS Best Paper Award at ETAPS 2004.)
"CodeSurfer/x86—A platform for analyzing x86 executables"
"Model checking x86 executables with CodeSurfer/x86 and WPDS++"
"WYSINWYX: What You See Is Not What You eXecute"
"A next-generation platform for analyzing executables"
"Intermediate-representation recovery from low-level code"
"Extracting output formats from executables"
"Recency-abstraction for heap-allocated storage"
"DIVINE: DIscovering Variables IN Executables"
"Low-level library analysis and summarization"
"Numeric program analysis techniques with applications to array analysis and library summarization" (Gopan thesis)
"WYSINWYX: What You See Is Not What You eXecute" (Balakrishnan thesis)
"Analyzing stripped device-driver executables"
"Improved memory-access analysis for x86 executables"
"WYSINWYX: What You See Is Not What You eXecute" (Journal version)

[Back to the top]

Transformer Specification Language (TSL)

In 2008, my student Junghee Lim and I created the TSL system [LR08a, LR12], a meta-tool to help in the creation of tools for analyzing machine code. (TSL stands for Transformer Specification Language.) From a single specification of the concrete semantics of a machine-code instruction set, TSL automatically generates correct-by-construction implementations of the state-transformation functions needed in state-space-exploration tools that use static analysis, dynamic analysis, symbolic analysis, or any combination of the three [LR08a, LLR09, LLR09b].

The TSL language is a strongly typed, first-order functional language with a datatype-definition mechanism for defining recursive datatypes, plus deconstruction by means of pattern matching. Writing a TSL specification for an instruction set is similar to writing an interpreter in first-order ML. For instance, the specification of an instruction set's concrete semantics is written as a TSL function

        state interpInstr(instruction I, state S) {...};

where instruction and state are user-defined datatypes that represent the instructions and the semantic states, respectively.

TSL's meta-language provides a fixed set of base-types; a fixed set of arithmetic, bitwise, relational, and logical operators; and a facility for defining map-types. The meanings of the meta-language constructs can be redefined by supplying alternative interpretations of them. When semantic reinterpretation is performed in this way—namely, on the operations of the meta-language—it is independent of any given instruction set. Consequently, once a reinterpretation has been defined that reinterprets TSL in a manner appropriate for some state-space-exploration method, the same reinterpretation can be applied to each instruction set whose semantics has been specified in TSL.

The reinterpretation mechanism causes TSL to act as a "Yacc-like" tool for automatically generating correct-by-construction implementations of the state-transformation functions needed in state-space-exploration tools. Moreover, because all components are generated from a single specification of the concrete semantics of an instruction set, the generated implementations are guaranteed to be mutually consistent (and also to be consistent with an instruction-set emulator that is generated from the same specification of the concrete semantics).

Bibliography:

"A system for generating static analyzers for machine instructions" (Awarded the EAPLS Best Paper Award at ETAPS 2008.)
"Symbolic analysis via semantic reinterpretation"
"Symbolic analysis via semantic reinterpretation" (Journal submission)
"There's plenty of room at the bottom: Analyzing and verifying machine code"
"Directed proof generation for machine code"
"Transformer Specification Language: A system for generating analyzers and its applications" (Lim thesis)
"TSL: A system for generating abstract interpreters and its application to machine-code analysis"

[Back to the top]

Weighted Pushdown Systems and their Application to Interprocedural Dataflow Analysis

This work has established new connections between interprocedural dataflow analysis and model checking of pushdown systems (PDSs). Various connections between dataflow analysis and model checking have been established in past work, e.g., [S91, S93, S98, EK99, CC00]; however, with one exception ([EK99]), past work has shed light only on the relationship between model checking and bit-vector dataflow-analysis problems, such as live-variable analysis and partial-redundancy elimination. In contrast, the results obtained in our work [RSJ03, RSJM05] apply to (i) bit-vector problems, (ii) the one non-bit-vector problem addressed in [EK99], as well as (iii) certain dataflow-analysis problems that cannot be expressed as bit-vector problems, such as linear constant propagation [SRH96] and affine-relation analysis [MOS04]. In general, the approach can be applied to any distributive dataflow-analysis problem for which the domain of transfer functions has no infinite descending chains. (Safe solutions are also obtained for problems that are monotonic but not distributive.)

The contributions of the work can be summarized as follows:

Conventional dataflow-analysis algorithms merge together the values for all states associated with the same program point, regardless of the states' calling context. With the dataflow-analysis algorithm obtained via weighted PDSs, dataflow queries can be posed with respect to a regular language of stack configurations. Conventional merged dataflow information can also be obtained by issuing appropriate queries; thus, the new approach provides a strictly richer framework for interprocedural dataflow analysis than is provided by conventional interprocedural dataflow-analysis algorithms.
Because the algorithm for solving path problems in weighted PDSs can provide a witness set of paths, it is possible to provide an explanation of why the answer to a dataflow query has the value reported.

The algorithms described in [RSJ03, RSJM05], along with the improvements described in [LBR05, LR06, LKRT07, LR08b], have been implemented in three libraries for use in creating program-analysis tools (one implemented in C [Schwoon04], and two implemented in C++ [KRML04, KLR07]). These libraries have been used to create prototype implementations of context-sensitive interprocedural dataflow analyses for uninitialized variables, live variables, linear constant propagation, and the detection of affine relationships.

We have also used pushdown systems to create a novel algorithm for program slicing [AHJR12].

Weighted pushdown systems are also a useful formalism in a completely separate application domain, namely, access control of shared computing resources in distributed systems [JR02, SJRS03, JR04, JSWR06, WJRSS06].

Bibliography:

Symbolic Methods for Abstract Interpretation

Since 2002, I have been interested in symbolic methods for abstract interpretation. The work aims to bridge the gap between (i) the use of logic for specifying program semantics and performing program analysis, and (ii) abstract interpretation [RSY04, YRS04, TR12a, TR12b, TER12, ELSAR13, TLLR13]. In [RSY04], Mooly Sagiv, Greta Yorsh, and I gave the first method that makes such a connection: we showed that, for a broad class of situations, given a formula φ in some logic L, it is possible to find the best value a in a given abstract domain Abs that over-approximates the meaning of φ (i.e., [[φ]] ⊆ γ(a)).

We call this problem symbolic abstraction. If you think of logic L as being a rich language for expressing meanings, and abstract domain Abs as corresponding to a logic fragment L' that is less expressive than L, then symbolic abstraction addresses a fundamental approximation problem:

Given φ ∈ L, find the strongest consequence of φ that is expressible in logic fragment L'.

Since 2011, my student Aditya Thakur and I have been pursuing the insight that generalized versions of an old, and not widely used, method for validity checking of propositional-logic formulas, called Stålmarck's method, hold great promise for symbolic abstraction. The symbolic methods for abstract interpretation that we have subsequently developed based on this insight [TR12a, TR12b, TER12, TLLR13] offer much promise for building more powerful program-analysis tools. These methods (i) allow more precise implementations of abstract-interpretation primitives to be created, and (ii) drastically reduce the time taken to implement such primitives because correctness is ensured by construction. In particular, they provide a way to create correct-by-construction abstract interpreters that also attain the fundamental limits on precision that abstract-interpretation theory establishes. For instance, in [TLLR13], we described a systematic method for implementing an abstract interpreter that solves the following problem:

Given program P, and an abstract domain A, find the most-precise inductive A-invariant for P.

Moreover, our generalized-Stålmarck methods are "dual-use":

In addition to providing methods for building improved abstract-interpretation tools, they also provide methods for building improved logic solvers that use abstract interpretation to speed up the search that a solver carries out.

In particular, we showed how Stålmarck's method is an instantiation of a generic framework parameterized by an abstract domain over Booleans; consequently, different instantiations of the framework lead to new decision procedures for propositional logic [TR12b]. Moreover, this abstraction-based view allows us to lift Stålmarck's method from propositional logic to richer logics: to obtain a method for richer logics, instantiate the parameterized version of Stålmarck's method with richer abstract domains [TR12a]. We call such a decision-procedure design, which is parameterized by an abstract domain, a Satisfiability Modulo Abstraction (SMA) solver [RT14].

Recently, we created an SMA solver for separation logic [TBR14]. Separation logic (SL) is an expressive logic for reasoning about heap structures in programs, and provides a mechanism for concisely describing program states by explicitly localizing facts that hold in separate regions of the heap. SL is undecidable in general, but by using an abstract domain of shapes [SRW02] we were able to design a semi-decision procedure for SL.

Bibliography:

[Back to the top]

Policy Weaving

System security is a cross-cutting issue. Unfortunately, it is hard to devise a formalism that allows system-wide policies to be both (i) easily understood and (ii) easily implemented and maintained. In particular, when the elements of security policies are expressed directly in the code—e.g., via calls on security primitives, as in Decentralized Information Flow Control (DIFC) systems, such as Asbestos, HiStar, and Flume, or in capability-based systems, such as Capsicum—the policy implementation is spread throughout the codebase of an application. With such a structure, it is difficult to understand what overall policy the application actually implements, which complicates documenting the policy, debugging the policy, and updating the policy as the codebase evolves in response to changes in requirements and the threat landscape.

In contrast, the focus of our work is on a declarative approach to system security. In our approach, a policy specification is a first-class notion—e.g., the policy specification for a given application is a single document, so that all elements of the policy are gathered together in one place, rather than scattered throughout the codebase. A security policy is still implemented in terms of a collection of code snippets to invoke policy-enforcement primitives at different points in the program. However, users are freed from the task of introducing such snippets manually.

The goal of our work is to create tools to support policy weaving, which will handle the job of rewriting the program to place code at appropriate locations so that suitable security primitives are invoked at appropriate moments to enforce the policy. A policy-weaving tool takes as input (i) a program P, and (ii) a policy R (written in some suitable policy-specification language), and rewrites P to enforce R. A policy-weaving tool supports a declarative approach to security. The policy R can be stated as a single, easily-understood document, and it is the obligation of the policy-weaving tool to find a way to place code for invoking enforcement primitives so that all executions of P satisfy R.

We also work on meta-tools to support policy weaving. In particular, one of our goals is to create a policy-weaver generator. Such a tool takes as input specifications of

the semantics of a programming language L
the semantics of the policy-enforcement primitives to be used

and generates a policy weaver.

Our vision is similar to aspect-oriented programming (in that a cross-cutting issue is specified separately and then automatically woven into the fabric of the implementation). However, the approach that we envision goes beyond aspect-oriented programming in an important way: whereas conventional aspect-oriented programming systems use syntactically-oriented mechanisms and allow aspect code to be placed in a rather limited set of locations, our approach is based on programming-language and policy semantics to identify automatically the program locations where policy-enforcement primitives need to be placed. Our approach also goes beyond simple finite-state reference monitors: policies have visibility into such entities as processes, principals, and communication channels. A key challenge is to design a policy weaver that can reason about the potentially-unbounded set of states that a program may reach, and to choose system primitives and parameters from a potentially unbounded number of options to keep the program from reaching states that constitute a violation of the policy provided by the programmer.

Our approach to designing a weaver generator is based on the theory of two-player safety games. Such a game is played by two players, called the Attacker and the Defender, on a graph in which each node is labeled as an Attacker node or a Defender node. The state of the game is a node of the graph. The goal of the Attacker is to reach a state labeled as an accepting node; the goal of the Defender is to thwart the Attacker by preventing the Attacker from ever reaching an accepting node. The game is played in turns: on each turn, if the state is an Attacker state, then the Attacker can change the state to any adjacent node of the graph, and analogously for Defender states and Defender-initiated state changes. A winning Defender strategy for a game is a decision tree of choices that the Defender can make so that an accepting state is never reached.

The core of a policy weaver created by our weaver generator is an algorithm to find a winning Defender strategy [HJR12]. A policy weaver generated by the policy-weaver generator works as follows: given program P and policy R, the policy weaver constructs a game in which

the Attacker represents the threat model (i.e., all possible system attacks),
the Defender represents possible instrumentations of the program, and
accepting nodes represent system states reached when the instrumented program violates the policy.

If the weaver finds a winning Defender strategy for the game, it uses the strategy to instrument the program to satisfy the policy.

We have also been working on applying policy weaving to JavaScript programs. JavaScript is the most popular language used to create dynamic content for web browsers. Moreover, the nature of the Web has resulted in an ecosystem in which JavaScript programs and libraries, often written by untrusted third parties, are regularly incorporated into applications and combined in every way imaginable. While this practice has obvious benefits in terms of convenience and not ``reinventing the wheel,'' it creates glaring security holes because of a complete lack of isolation between modules in a JavaScript environment. We chose this context to apply our policy-weaving techniques because of the obviously unfilled need for a security mechanism, and because of the challenge presented by JavaScript's highly dynamic, untyped nature.

We have created the the JAM policy weaver [FJJRPSY12], coupled with the JAMScript enforcement mechanism. (``JAM'' stands for JAvascipt Model checker.) JAM can be used by a website administrator to apply a security policy—in the form of a finite-state automaton—that specifies a set of disallowed execution traces of a JavaScript program. The program is analyzed statically with respect to the policy, and an instrumented program that is guaranteed to not violate the policy specification is produced. This analysis consists of a counterexample-guided, abstraction-refinement loop that repeatedly creates an incrementally more precise abstract model of the program. Because this model represents a superset of the program's actual behaviors, we can say that any execution traces that are not realizable in the model also cannot happen in the concrete program.

Static analysis can be expensive, especially in the context of a highly dynamic language such as JavaScript. Thus, the algorithm used in JAM provides for a system-resource cutoff threshold, which terminates the analysis when the threshold is reached. This case still results in a safely instrumented program, although the program may contain some amount of spurious instrumentation, resulting in extra runtime overhead. In this way, JAM can be tuned to balance the trade-off between static-analysis overhead and dynamic overhead.

The instrumentation that JAM weaves into a JavaScript program comes in the form of transactions. Potentially malicious statements in the untrusted code (i.e., those that could not be proven safe by the static analysis) are wrapped in transaction blocks, which are speculatively executed in a manner that records all actions (reads, writes, and function calls) and suppresses all effects of the actions. This approach allows the effects of the statements to be evaluated with respect to the security policy, at run-time inside the transaction. The program is terminated—without the effects of the actions being visible externally—if it is determined that any prefix of the sequence of actions will violate the policy. If no policy-violating action sequence is observed, the transaction is committed, and execution continues as before.

Bibliography:

[Back to the top]

Computer Sciences | UW Home

Feedback or content questions: send email to "reps" at the cs.wisc.edu server
Technical or accessibility issues: lab@cs.wisc.edu
Copyright © 2002, 2003 The Board of Regents of the University of Wisconsin System.