next up previous
Next: Experimental procedure Up: Experiments Previous: Experiments

Datasets

We collected a benchmark dataset consisting of 5 domains, each with 20 documents. The dataset simulates a collection of sensitive documents that an adversary might be interested in. The domains are:

1. Medical records: anatomy case studies from the University of Michigan's medical school at http://medical.med.umich.edu/courseinfo/clinical_index.html.

2. CIA: declassified CIA documents from Gale's Declassified Documents Reference System at http://galenet.galegroup.com/servlet/DDRS.

3. Email: messages from the Enron email corpus at http://www.cs.cmu.edu/~enron/.

4. Stock: annual reports aggregated by InvestorCalendar at http://investorcalendar.ar.wilink.com.

5. Switchboard (SW): telephone conversation transcripts from LDC online's free section of Switchboard.

Table 1 presents statistics and examples of each domain.



Nathanael Fillmore 2008-07-18