Data Sets

The following collections are machine learning data sets that I have created over the course of my research. Feel free to use them in your own work, and be sure to cite the papers listed in README files, if appropriate.

20 Newsgroups. A semi-synthetic text classification corpus for the multiple-instance learning setting.
amil-newsgroups.tar.gz (1.9mb)
SIVAL. An instance-labeled content-based image retrieval (CBIR) repository. For use with mixed-granulatiry multiple-instance learning.
amil-sival.tar.gz (118.2mb)
SIVAL-Times. Metadata regarding the annotation process of the SIVAL repository.
sival-times.tar.gz (9kb)
SigIE. An information extraction (IE) corpus of email signature lines, annotated for twelve address book fields (e.g., email, phone, jobtitle). Also contains metadata regarding the annotation process.
sigie.tar.gz (65kb)
Speculative Text. A collection of PubMed abstracts for subjectivity analysis, annotated for speculative vs. definite language. Also contains metadata regarding the annotation process.
spec.tar.gz (194kb)

« Burr Settles Homepage