Authorship Attribution

The following materials supplement the paper "Who Wrote This Code? Identifying the Authors of Program Binaries" [PDF], currently under submission. The paper describes techniques to extract a broad set of binary code features, using these features to address two authorship attribution problems: (1) authorship classification, or predicting the most likely author of a program out of a set of known candidates, and (2) authorship clustering, an unsupervised task that groups programs together by their stylistic similarity. We provide the following:

Extended paper. The technical report version of our paper [PDF] includes additional description of the data sets, additional evaluation details, and specification of the graph collapse algorithm for supergraphlet features.
Prototype feature extraction code. We provide the feature extraction utilities with which we formatted the program binaries for our experiments, as well as notes on usage.
Data sets. We use data sets derived from the Google Code Jam programming competition and from an operating systems course at the University of Wisconsin. Due to privacy concerns, we are not able to distribute the source code for this latter data set; the former can be obtained from the Code Jam archives. Instead, we provide data files containing the program representation using the binary code features we describe in the paper. Our data can be downloaded below.

Unless otherwise noted, the code published on this website is copyright Barton Miller and the Paradyn Project and is licensed under the LGPL.

Download

The feature extraction code is available.

Our experimental data are available in two formats. Data files containing the reduced set of features we use in evaluation (after feature selection) can be obtained in data-exp.tgz (10 MB). The full feature representation is in data-full.tgz (32 MB).

Contact

Contact Nathan Rosenblum at UW-Madison Computer Sciences.