Section 1

Fall 2016

**Toward a Static Analysis of Programs with Postselection via Pseudodistributions**

*Andrew Morgan*

Postselection is the power of a randomized algorithm to make
assertions (in the programming sense) about its randomness, and one is
typically interested in how such a program behaves when the randomness
is *conditioned* on passing all the assertions. We investigate a
static analysis framework for a simple programming language, with no
loops and where all variables take values in {0,1}. The goal of the
analysis is to compute the expected value of low-degree polynomials of
the program variables, conditioned on all assertions passing. This
could be done by computing all the low-degree moments of the
distribution.

In order to make this problem tractable, we relax the need to compute the exact moments, and instead aim to compute a family of "plausible" moments---moments that may not correspond to any true distribution, but nevertheless "look right" from the perspective of low-degree polynomials. For instance, if the relationship "Z == X && Y" holds, then the equation E[Z] = E[XY] of moments should hold true. These "pseudomoments" correspond to the notion of a pseudodistribution used in the Lasserre semidefinite programming hierarchy.

In this work, we make partial but significant progress toward realizing such a framework, including (1) a generic algorithm that reduces the static analysis problem to devising an appropriate "gluing" operation and a choice of some parameters, as well as (2) a complete parameterization of all the choices available to the "gluing" operation for degree-2 pseudodistributions for read-once programs, among some other ideas.

**Generating Ranking Functions for Large Programs**

*Zi Wang*

This project investigates automating Floyd/Naur/Hoare proof method for large programs. The FNH method is used to verify the termination of semialgebraic programs. Floyd (1967) proposed a ranking function method which can be used to prove the termination of a loop program. Subsequently, researchers have investigated how to automatically generate ranking functions. Podelski and Rybalchenko (2004) proposed a complete method for generating linear ranking functions. Cousot (2005) proposed a method for synthesis of quadratic ranking functions based on semidefinite programming (SDP).

In this project, we studied the synthesis of ranking functions by a recent optimization technique, the Alternating Direction Method of Multipliers (ADMM). Compared to the common SDP solver (which uses the interior point algorithm), ADMM has a relatively weak guarantee of optimal solution. However, it running time is significantly less than interior point method. ADMM is recently widely used in the machine learning/big data community because of its scalability. In the meantime, as our problem is a feasible problem, an optimal solution is not necessary. Therefore, ADMM is also good for the synthesis of ranking functions for large programs.

**Partial Evaluation of M(UMPS) Scripts**

*Luke Matthews*

The M(UMPS) programming language is ubiquitous in the Electronic Medical Records software industry and largely unknown outside of it. The language has a number of interesting features: it is dynamically typed, allows variables to be used without being declared, supports inter-process data structures, has significant whitespace, supports some indirection, and lacks reserved words (relying on context).

For this project, I implemented an online partial-evaluator for a subset of the M-programming language's constructs. I utilized an existing M-function capable of generating an AST for an input M-Routine (collection of functions and sub-routines). The evaluator walks the AST tree and tracks variable static/dynamic state according to the language's semantics; it handles enough language constructs to support evaluation of non-trivial M-programs.

The presentation will provide a quick overview of the M-language, the subset of features supported by the evaluator, a discussion of the evaluator's design and the motivations behind the choices, and potential strategies for evaluating the remaining language features.

**Evaluation Strategies for SQL Queries**

*Hakan Memisoglu*

The recent decade saw the databases are expected to handle more sophisticated workloads such as million transactions per second and sub-second response to analytical queries running on gigabytes of data. Most used database systems cannot handle such heavy workloads. Therefore, the database systems see a major revamping. In-memory architectures are replacing the old disk based architectures to handle this complex transactional and analytical workloads. This major shift in the database systems forces the system designers to cope up with the new bottleneck; "the memory bandwidth" since the previous biggest bottleneck; the disk bandwidth is no longer a concern. To process huge amount data in speed of memory requires the programmers to use sophisticated lock free data structures, cache aware algorithms, lightweight database components. However, the most important performance aspect of a database system is how to evaluate a given query. In this work, I will give a brief survey on the history of query evaluation, and recent trends (block processing, compilation) to process modern workloads. I will also compare the performance characteristics of different evaluation techniques.

The first models that I will describe will be tuple level and block level Volcano processing models. Later, I will describe how to partially evaluate (compile) the queries. The different approaches for the compilation will be considered. The first one that will be mentioned is operator-centric query compilation where the physical operators in query plan will be compiled according to templates separately. Then, I will mention "the state of art" data-centric approach where the query plan will be compiled by restructuring the query into a sequence nested for loops by using JIT library.

Later on, I will briefly mention about the implementations that I developed for these techniques. I will compare the runtime performance of these techniques by running a set of simple SQL-like queries. Additionally, some assembly and IR snippets of queries produced by the JIT library will be showed at the end.

**Array Bounds Checking for C Programs Using Static Analysis**

*Sizhou Ma and Jongho Lee*

In this project, we have developed a tool for static analysis for C programs, which aims to verify the program is free of array out-of-bounds errors or generates warnings where potential out-of-bounds access might happen. The pentagons abstract domain is utilized for static analysis. The tool is implemented in C++, with the help of the LLVM framework.

Our presentation will be composed of three parts. First, a brief introduction to the pentagons domain will be given. Second, details about how the pentagons domain is used to analyze C source code will be discussed. Finally, some testing C code will be shown.

**Extended Access Graphs: A New Tool for Specification Inference and Program Understanding**

*Jordan Henkel*

Programs are complex objects. Due to their complexity numerous representations, each focusing on specific traits of a program, have been designed over time. Some representations are lightweight and relatively easy to compute (such as CFGs) whereas other representations are more complex and can take significant time to extract (such as System Dependence Graphs). I will introduce a lightweight, low-complexity, structure that focuses on the observed usages of complex data in programs. I label this structure an Extended Access Graph (EAG) as it represents a composite view of `accesses' to structured data in a program. These Extended Access Graphs are generated for each procedure in a program and they can be sliced, given a program element, to obtain finite automata. The languages of these automata describe an over-approximation of the allowable usages of the chosen program element. Furthermore, these finite automata can be combined into Nested Word Automata representing the allowable interprocedural usages of the chosen program element.

The presentation will focus on motivating the usefulness of Extended Access Graphs and outlining how they are computed in the intraprocedural case. The beginning of the talk will introduce Access Paths as the inspiration for the concept of an Access Graph. Following this, relevant background and some challenges related to analysis with Access Paths will be introduced. In addition, my presentation will outline the various steps necessary to transform a piece of real C code into an EAG (including some of the practical challenges) and eventually a Finite Automata. The presentation will conclude with an outline of the possible interprocedural extension of this work via the use of Nested Word Automata.

**Solving Recurrences via the Operational Calculus**

*John Cyphert*

In this talk I will present an implementation of a symbolic, linear, first order, non-homogeneous, recurrence solver, OCRS. The basic techniques of this system are based on the operational calculus algebra defined by Lothar Berg in Introduction to the Operational Calculus. OCRS harnesses the properties of the operational calculus algebra by first converting a given recurrence in elementary algebra to a problem in the domain described by Berg, and then OCRS seeks to solve this new problem. If a solution is found, OCRS will then be able to convert that solution back to elementary algebra, and thus will have a closed form solution for the recurrence.

I will also present the computer-algebra techniques that OCRS uses to aid itself in the process of solving a recurrence relation. These techniques include algebraic expansion, symbolic simplification, and partial fraction decomposition. I will conclude the talk with some applications of OCRS, such as computation of loop summaries, as well as the potential extensions of OCRS. These extensions would enable solving of: recurrences with arbitrary order, non-linear recurrences, and multivariate recurrences.

**Abstract Interpretation of Neural Networks**

*Samuel Drews and Ashok Vijaya Mukilan*

Neural Networks are increasingly used to solve complex tasks, but the trained models are largely incomprehensible; consequently, there is growing concern about the disparity between the widespread use of neural networks and our understanding of how they make decisions. To that end, we aim to develop a technique that provides simple descriptions of neural network behavior. Specifically, we focus on over-approximating the set of inputs that result in a specific output for a rectified linear neural network used for binary classification. We approach this problem through the lens of symbolic abstraction.

We solve the general problem of symbolic abstraction in the domain of bounded convex polytopes: given a linear real arithmetic formula phi, we want to find the tightest over-approximation of phi in the form of a bounded convex polytope. Because the neural networks use linear activation functions, regions in their decision space can be represented as LRA formulae, making this approach applicable. The target best over-approximation of a region is found by starting from an initial polytope lower in the lattice than the target and growing it in size until we reach the target. The technique is sound, complete, and terminates, and it is also able to detect when the LRA formula is unbounded, I.e. there is no over-approximation in this bounded abstract domain. We evaluate our approach on a number of neural networks learned from online datasets.

**Program Synthesis on MapReduce**

*Haruki Yamaguchi*

As necessity of Big Data has rapidly spread among society, usage of distributed system to process such information has been more generalized. The focus of this project is to ease creating a program based on distributed system; particularly, program synthesis on MapReduce. In contrast with [1], which considers a higher-order sketch as the user input, this project takes tables (i.e., SQL database) as a part of input, in order to let users with little knowledge of MapReduce bypass writing actual code, but just provide a set of small sample inputs, tables, and outputs. A fundamental idea is to map the input to the table and/or to apply predefined basic aggregation to produce the corresponding outputs. Consequently, this project mainly applies approaches of data science, such as data extraction, matching, and integration.

[1] Calvin Smith, Aws Albarghouthi: MapReduce program synthesis. PLDI 2016: 326-340

**Extracting Network Signatures from Malware Automatically**

*Kausik Subramanian and Arjun Gurumurthy*

Malware is one of the primary threats to modern-day networks. While the most prevalent form of malware detection is endhost-based (e.g., matching code hashes), a detection mechanism integrated directly into the network offers a much-needed additional layer of protection. Toward this goal, we propose automatic generation of accurate network signatures of malware (informally, a network signature is the set of messages received by the malware to initiate attacks) using program analysis techniques and SMT solvers. These network signatures can be used by intrusion detection systems to detect malware traffic in the network. We present some preliminary work toward this goal.

**An Interactive Feedback Generator for Computer Science Students**

*Qinheping Hu and Michael Vaughn*

One of the main challenges computer science educators face is grading and providing constructive feedback to students in a timely manner. With the increasing enrollment in CS programs, as well as the advent of MOOCs, the prospect of using program repair techniques has become the subject of a growing body of research. In this work we study and implement a debugger that automatically repairs programs.

Unlike existing repair tools, which depend on instructor-supplied specifications, our repair tool lets students interactively discover bugs and obtain repair suggestions which are semantically and syntactically similar to the student's original program. We will first present the background of the synthesis tool Sketch and the program repair scheme Qlose. With them we can then illustrate the built-in interpreter for converting a buggy Java program to a Sketch script whose synthesis results correspond to possible repairs of the buggy program.

We will next describe improvements we made to the implementation of our debugger. For our qualitative evaluation, we will demonstrate a linear-combination based expression-repair scheme, as well as a technique for handling interprocedural repairs. For the quantitative evaluation, we will discuss several techniques for reducing the size of the search space, including dynamic slicing as well as several heuristic-based dynamic analysis techniques which we believe are a good fit for programs commonly implemented by first-year CS students.