Data Sets
        
        The following collections are machine learning data sets that I have created over the course of my research. Feel free to use them in your own work, and be sure to cite the papers listed in README files, if appropriate.
        
        
        	- 20 Newsgroups. A semi-synthetic text classification corpus for the multiple-instance learning setting.
 amil-newsgroups.tar.gz (1.9mb)
- SIVAL. An instance-labeled content-based image retrieval (CBIR) repository. For use with mixed-granulatiry multiple-instance learning.
 amil-sival.tar.gz (118.2mb)
- SIVAL-Times. Metadata regarding the annotation process of the SIVAL repository.
 sival-times.tar.gz (9kb)
- SigIE. An information extraction (IE) corpus of email signature lines, annotated for twelve address book fields (e.g., email, phone, jobtitle). Also contains metadata regarding the annotation process.
 sigie.tar.gz (65kb)
- Speculative Text. A collection of PubMed abstracts for subjectivity analysis, annotated for speculative vs. definite language. Also contains metadata regarding the annotation process.
 spec.tar.gz (194kb)
« Burr Settles Homepage