Data Sets
The following collections are machine learning data sets that I have created over the course of my research. Feel free to use them in your own work, and be sure to cite the papers listed in README files, if appropriate.
You may also be interested in my software.
- 20 Newsgroups.
[amil-newsgroups.tar.gz, 1.9mb]
A semi-synthetic text classification corpus for the multiple-instance learning setting. - SIVAL. [amil-sival.tar.gz, 118.2mb]
An instance-labeled content-based image retrieval (CBIR) repository. For use with mixed-granularity multiple-instance learning. - SIVAL-Times. [sival-times.tar.gz, 9kb]
Metadata regarding the annotation process of the SIVAL repository. - SigIE. [sigie.tar.gz, 65kb]
An information extraction (IE) corpus of email signature lines, annotated for twelve address book fields (e.g., email, phone, jobtitle). Also contains metadata regarding the annotation process. - Speculative Text. [spec.tar.gz, 194kb]
A collection of PubMed abstracts for subjectivity analysis, annotated for speculative vs. definite language. Also contains metadata regarding the annotation process.