The following collections are machine learning data sets that I have created over the course of my research. Feel free to use them in your own work, and be sure to cite the papers listed in README files, if appropriate.
- 20 Newsgroups. A semi-synthetic text classification corpus for the multiple-instance learning setting.
- SIVAL. An instance-labeled content-based image retrieval (CBIR) repository. For use with mixed-granulatiry multiple-instance learning.
- SIVAL-Times. Metadata regarding the annotation process of the SIVAL repository.
- SigIE. An information extraction (IE) corpus of email signature lines, annotated for twelve address book fields (e.g., email, phone, jobtitle). Also contains metadata regarding the annotation process.
- Speculative Text. A collection of PubMed abstracts for subjectivity analysis, annotated for speculative vs. definite language. Also contains metadata regarding the annotation process.
« Burr Settles Homepage