Publications by Year
BibTeX entries for current publications: settles.bib
Click here for: related work and citations
2009
-
G. Druck, B. Settles, and A. McCallum.
Active Learning by Labeling Features.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), to appear. ACL Press, 2009.
In natural language tasks, features can often be intuitively labeled (e.g., in extracting information from apartment classifieds, "WORD=deposit" might indicate the label "lease," or "WORD=pets" indicate "restrictions"). We introduce novel query algorithms and user labeling interfaces for feature-based active learning in such domains. [pdf] -
B. Settles.
Active Learning Literature Survey.
Computer Sciences Technical Report 1648, University of Wisconsin-Madison. 2009.
An introduction to active learning and a survey of the literature. This paper outlines the various learning scenarios, query strategy frameworks, variants, application domains, and related work published over the past few decades. [pdf] -
B. Settles.
A Software Tool for Biomedical Information Extraction (and Beyond).
In V. Prince and M. Roche (Eds.), Information Retrieval in Biomedicine: Natural Language Processing for Knowledge Integration, pages 326-335. IGI Global Press, 2009.
An overview of biomedical named entity recognition with conditional random fields using ABNER (see the Bioinformatics 2005 paper below). Includes a survey of higher-level information management tasks (relation extraction, information retrieval, automatic database curation, etc.) that have been built on top of ABNER.
2008
-
B. Settles.
Curious Machines: Active Learning with Structured Instances.
PhD thesis, University of Wisconsin-Madison. 2008.
My PhD thesis on active learning for structured input representations (e.g., sequence labeling and multiple-instance learning tasks) and queries with potentially varying annotations costs. Also introduces the information density (ID) and expected gradient length (EGL) active learning frameworks. [pdf] -
B. Settles and M. Craven.
An Analysis of Active Learning Strategies for Sequence Labeling Tasks.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1069-1078. ACL Press, 2008.
Active learning has not been well-studied for structured prediction tasks such as information extraction. This paper expands the frontier of query strategies for sequence models (CRFs, HMMs, PCFGs, etc.) into several new query frameworks, and presents a large-scale empirical evaluation of these algorithms on eight benchmark data sets. [pdf]
-
B. Settles, M. Craven, and L. Friedland.
Active Learning with Real Annotation Costs.
In Proceedings of the NIPS Workshop on Cost-Sensitive Learning. 2008.
Do annotation costs vary across instances? Among annotators? Can these costs be accurately predicted? What impact might this have on active learning in practice? This paper addresses these questions with a detailed empirical study of real-world annotations costs, and presents a novel approach to cost-sensitive active learning by modeling unknown annotation costs directly. [pdf][data] -
B. Settles, M. Craven, and S. Ray.
Multiple-Instance Active Learning.
In Advances in Neural Information Processing Systems (NIPS), volume 20, pages 1289-1296. MIT Press, 2008.
In multiple-instance (MI) learning, instances are organized into bags, which can be labeled inexpensively but ambiguously. In some MI problems, finer-granularity instance labels can be obtained, which are less ambiguous but more costly. This paper motivates a novel active learning framework for MI learners that allow them to query and learn from labels at mixed levels of granularity. [pdf][code][data]
2007
-
A. Goldberg, D. Andrzejewski, J. Van Gael, B. Settles, X. Zhu and M. Craven.
Ranking Biomedical Passages for Relevance and Diversity.
In Proceedings of the Fifteenth Text Retrieval Conference (TREC). 2007.
An information retrieval system for biomedical text, focused on query generation and result ranking using a PageRank-style algorithm. The proposed ranker encourages both relevance and diversity in top ranked items, by turning retrieved items into absorbing states on a graph. [pdf][code].
2006
-
T. Brow, B. Settles and M. Craven.
Classifying Biomedical Articles by Making Localized Decisions.
In Proceedings of the Fourteenth Text Retrieval Conference (TREC). 2006.
This paper presents a variety of machine learning approaches that exploit document-passage relationships both in classification and in learning. Results support our hypothesis that, for some text classification tasks, only certain passages of text are relevant to the task at hand. [pdf].
2005
-
B. Settles.
ABNER: An Open Source Tool for Automatically Tagging Genes, Proteins, and Other Entity Names in Text.
Bioinformatics, 21(14):3191-3192. 2005.
An introduction to ABNER, a state-of-the-art, open-source, biomedical information extraction tool written in Java. It works stand-alone or as an API for inclusion in more sophisticated information management systems. [pdf][software] -
B. Settles and M. Craven.
Exploiting Zone Information, Syntactic Features, and Informative Terms in Gene Ontology Annotation from Biomedical Documents.
In Proceedings of the Thirteenth Text Retrieval Conference (TREC). 2005.
A system that predicts Gene Ontology (GO) annotations for research articles using a two-tier machine learning approach. First, articles are segmented into "zones" (abstract, introduction, conclusion, etc.) and classified using automatically induced syntactic and semantic features. Second, zone-level predictions are aggregated into overall document labelings. This was one of the top performing systems at the TREC Genomics track. [pdf]
2004
-
B. Settles.
Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets.
In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (NLPBA), pages 104-107. 2004.
This paper motivates biomedical named entity recognition using conditional random fields (CRFs) with a variety of orthographic and automatically induced semantic features. It was one of the top performing approaches in the NLPBA shared task evaluation. [pdf]