The following materials supplement the paper "Who Wrote This Code? Identifying the Authors of Program Binaries" [PDF], currently under submission. The paper describes techniques to extract a broad set of binary code features, using these features to address two authorship attribution problems: (1) authorship classification, or predicting the most likely author of a program out of a set of known candidates, and (2) authorship clustering, an unsupervised task that groups programs together by their stylistic similarity. We provide the following:

Unless otherwise noted, the code published on this website is copyright Barton Miller and the Paradyn Project and is licensed under the LGPL.

Download

The feature extraction code is available.

Our experimental data are available in two formats. Data files containing the reduced set of features we use in evaluation (after feature selection) can be obtained in data-exp.tgz (10 MB). The full feature representation is in data-full.tgz (32 MB).

Contact

Contact Nathan Rosenblum at UW-Madison Computer Sciences.