Project Midpoint Review: Do you really need to read() all of that?

URL: http://pages.cs.wisc.edu/~tjepsen/flow/progress.html

In the initial proposal, I said I would accomplish the following milestones in the first 8 weeks:

I think I have started working on these milestones in the reverse order. I will outline what I have accomplished so far, and then explain what is left to be done for the skipped milestones.

Accomplishments

I made a tool (aka a telescope) to track data flowing into a program from a file on disk. I track the data at a byte granularity. For each byte, I record the:

The tool is written in C++ using Intel's dynamic binary instrumentation tool, Pin. There are about 500 LOC in my implementation.
An important design decision (that I am still sorting out) is when to create a snapshot of the state of the memory of the program. Right now I create a snapshot when:

Below are some visualizations of the tool's output.


Tail a file

Here I read the last 3 bytes of a file using the UNIX tail command:
tail -c 3 ./alphanum.txt
This is the last snapshot of memory. I was interested to see that tail didn't read the whole file into memory, but called seek() to only read the last few bytes of the file. On the left hand side are the bytes as they are found in the file, and on the right is where they flowed to in the program's memory. Note that the colors have no significance; they are just there to help visually.

Click here to go view it interactively
Snapshot #:

Multiple bytes flowing to same location

I wrote a C program to read() two bytes from a file, and add them together, saving the result in memory. Here you can see that the two bytes flow from the disk into the same location in memory.

char a, b;
read(fd, &a, 1);
read(fd, &b, 1);
char x = a + b;
Click here to go view it interactively
Snapshot #:

Parsing a key=value file

I wrote a simple C program to load a file that looks like this:

key1=value1
key2=value2
needle=haystack
foo=bar

and store the key/values into an array of structs. Here is the state of the memory at snapshot #6, when a tainted byte was stored by the mov instruction.

Click here to go view it interactively
Snapshot #:

Parsing an XML file with xmllint

Here I use xmllint to load an XML file and evaluate an XPath expression on it:
xmllint --xpath "//A/C/text()" ./file.xml

Click here to go view it interactively
Snapshot #:

Which instructions access a byte from disk?

Now we look at data from disk as it is accessed by the program's instructions. On the left is the location of the bytes in the file, and on the right is the address of the instruction in the program.


Parsing a key=value file

Here we can see that some instructions only access the key, while some only access the value:

Click here to go view it interactively


What next?

Better Debugging via Output Tracing and Callstack-Sensitive Slicing examines the output of a program and determines the 'source' (callstack) of each output byte. I think it would be interesting to do the inverse, find the program slice that each input byte generates.

Since we know what instructions touch each byte, we know which bytes affect control flow. One thing that may be interesting to know is how much you can change the input without affecting the program's behaviour. To determine this, I propose finding constraints on each byte of input. For example, a constraint on the byte '<' in an XML file is that it has to equal '<'. Maybe a constraint on other bytes in the XML file is that they cannot equal '<'.

In A file is not a file: understanding the I/O behavior of Apple desktop applications, Harter et al. look at the access patterns on files. In particular, they look at a document file and manually infer what parts of the file are used for storing what type of data. I think it might be possible to automate this process using data flow analysis.

Some machine learning techniques could be used for classifying input bytes. For example, we could create a model where each byte has the following features:

And then similar bytes can be clustered together, based on their features. The last visualization shows how this could be feasible. We can see that all the keys are touched by instruction '0x400732', whereas all the values are touched by the instruction '0x400732'. This alone is enough for us to know that these groups of bytes can be classified differently.

Most of the focus so far has been on looking at snapshots of the program's execution state. Maybe we should also consider the time component. If you click on one of the links above and use your keyboard arrow keys to navigate through the snapshots, you can see how the memory layout evolves over time. There may be something interesting to look at there, but I don't know what!

Another thing that has to be addressed, is deciding what programs to use for evaluation. I have used some toy programs written in C, as well as more complex ones like xmllint. I still have to find more programs that use files differently: different file formats, different access patterns, file sizes, etc. One idea is large files that are used in scientific computing.