ABNER: A Biomedical Named Entity Recognizer
ABNER is a software tool for molecular biology text analysis. It began as a user-friendly interface for a system developed as part of the NLPBA/BioNLP 2004 Shared Task challenge. The details of that system are described in the paper below (Settles, 2004).
At ABNER's core is a statistical machine learning system using linear-chain conditional random fields (CRFs) with a variety of orthographic and contextual features. Version 1.5 includes two models trained on the NLPBA and BioCreative corpora, for which performance is roughly state of the art (F1 scores of 70.5 and 69.9 respectively; details here). The new version also includes a Java API allowing users to incorporate ABNER into their systems, as well as train and use models for other data. Here's a pretty screenshot:
Features
- Simultaneously recognize multiple named entities (2 trained models included).
- Intuitive, interactive user interface.
- Optional built-in tokenization and sentence segmentation algorithms, robust to wrapped lines and biomedical abbreviations.
- Open text files and save annotations (SGML, IOB, and ABNER formats supported).
- Directory-recursive batch annotation of text files.
- Java API to incorporate ABNER into customized biomedical text applications.
- API includes routines for training ABNER on new corpora.
Download and Legalese
Documentation for the API is available here (javadoc).
Java source code is
available as a gzipped tarball: abner-1.5.tar.gz (32.8kb).
Note:
You don't need the source to access the API, just make sure
"abner.jar" is in your classpath.
This software © 2004 by Burr Settles, Department of Computer Sciences, University of Wisconsin-Madison. It is provided "as is," with no representations or warranties of any kind. ABNER is now "open source" and released under the terms of the Common Public License. You are free to use the code under those terms. Of course, an acknowledgement is always a good idea:
B. Settles. ABNER: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 21(14):3191-3192. 2005.
Here is a BibTeX entry if you dig that sort of thing:
@article{settles.bioinf05,
Author = {B. Settles},
Journal = {Bioinformatics},
Number = {14},
Pages = {3191--3192},
Title = {{ABNER}: An open source tool for automatically tagging genes,
proteins, and other entity names in text},
Volume = {21},
Year = 2005}
System Requirements
The bundled ABNER application is platform-independent, though it requires the Java 2 (J2SE) environment to be installed. It has been tested on Linux, Windows XP, Solaris, and Mac OSX. A modern processor (500MHz+) and 256MB+ of RAM is recommended. Note: If you plan to modify and compile the source code yourself, you will first need the Java SDK 1.4, MALLET 0.3.1, and JLex to be installed and working.
Performance
The following are complete results for the two trained models included with ABNER v1.5 on their corresponding evaluation corpora using exact boundary matching. ("S-F1" refers to soft F1 scores where at least one boundary is correct, but a one-word error on one side is tolerated.)
| Entity | Recall | Precision | F1 | (S-F1) |
|---|---|---|---|---|
| Protein | 77.8 | 68.1 | 72.6 | (84.9) |
| DNA | 63.1 | 67.2 | 65.1 | (76.1) |
| RNA | 61.9 | 61.3 | 61.6 | (78.5) |
| Cell Line | 58.2 | 53.9 | 56.0 | (68.2) |
| Cell Type | 65.6 | 79.8 | 72.0 | (82.1) |
| Overall | 72.0 | 69.1 | 70.5 | (82.0) |
| Entity | Recall | Precision | F1 | (S-F1) |
|---|---|---|---|---|
| Protein | 65.9 | 74.5 | 69.9 | (83.7) |
History
- March 2005 - ABNER v1.5. (Combined NLPBA and BioCreative models, improved performance, made tokenization optional, introduced API, released open source.)
- July 2004 - YAGI v1.0. (A command-line tool trained on the BioCreative corpus.)
- June 2004 - ABNER v1.0. (GUI wrapper for the original NLPBA system.)
Bugs
If you encounter a java.lang.OutOfMemoryError, you may need to increase set your JVM's memory allocation (100MB seems to work well). To do this at the command line: java -Xmx100m -jar abner.jar
Similar Software
I am aware of a few other publicly available biomedical NER programs. ABNER is known to perform as well or better on comparable corpora, and is open-source with a customizable API.
- GAPSCORE (Chang et al., 2004) - http://bionlp.stanford.edu/gapscore/
- LingPipe (Alias-i Inc., 2003) - http://www.alias-i.com/lingpipe/
- AbGene (Tanabe & Wilbur, 2002) - ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/AbGene/
- KeX (Fukuda et al., 1998) - http://www.hgc.jp/service/tooldoc/KeX/intro.html
Acknowledgements
Thanks to Mark Craven for his advice, and Andrew McCallum and Aron Culotta for answering questions about MALLET (The toolkit that implements the CRF). Research related to this software was supported by NLM grant T15-LM007359 and NIH grant R01-LM07050.
References
- B. Settles. ABNER: An Open Source Tool for Automatically Tagging Genes, Proteins, and Other Entity Names in Text. Bioinformatics, 21(14):3191-3192., 2005.
- B. Settles. Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), pages 104-107. 2004.
- J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the International Conference on Machine Learning (ICML), pages 282-289. 2001.