We collected a benchmark dataset consisting of 5 domains, each with 20 documents. The dataset simulates a collection of sensitive documents that an adversary might be interested in. The domains are:
1. Medical records: anatomy case studies from the University of Michigan's medical school at http://medical.med.umich.edu/courseinfo/clinical_index.html.
2. CIA: declassified CIA documents from Gale's Declassified Documents Reference System at http://galenet.galegroup.com/servlet/DDRS.
3. Email: messages from the Enron email corpus at http://www.cs.cmu.edu/~enron/.
4. Stock: annual reports aggregated by InvestorCalendar at http://investorcalendar.ar.wilink.com.
5. Switchboard (SW): telephone conversation transcripts from LDC online's free section of Switchboard.
Table 1 presents statistics and examples of each domain.