ABNER: A Biomedical Named Entity Recognizer

Version 1.5 released! The new version sports performance improvements, more customizability, plus a new API to train ABNER on other corpora and incorporate them into your systems.

ABNER is a software tool for molecular biology text analysis. It began as a user-friendly interface for a system developed as part of the NLPBA/BioNLP 2004 Shared Task challenge. The details of that system are described in the paper below (Settles, 2004).

At ABNER's core is a statistical machine learning system using linear-chain conditional random fields (CRFs) with a variety of orthographic and contextual features. Version 1.5 includes two models trained on the NLPBA and BioCreative corpora, for which performance is roughly state of the art (F1 scores of 70.5 and 69.9 respectively; details here). The new version also includes a Java API allowing users to incorporate ABNER into their systems, as well as train and use models for other data. Here's a pretty screenshot:

Features.

Simultaneously recognize multiple named entities (2 trained models included).
Intuitive, interactive user interface.
Optional built-in tokenization and sentence segmentation algorithms, robust to wrapped lines and biomedical abbreviations.
Open text files and save annotations (SGML, IOB, and ABNER formats supported).
Directory-recursive batch annotation of text files.
Java API to incorporate ABNER into customized biomedical text applications.
API includes routines for training ABNER on new corpora.

Download and Legalese.

ABNER v1.5 (March 2005) is available as a bundled Java archive: abner.jar (9.5mb).
To run it: execute this command from a terminal: java [-Xmx100m] -jar abner.jar

Documentation for the API is available here (javadoc).
Java source code is available as a gzipped tarball: abner-1.5.tar.gz (32.8kb).
Note: You don't need the source to access the API, just make sure "abner.jar" is in your classpath.

This software © 2004 by Burr Settles, Department of Computer Sciences, University of Wisconsin-Madison. It is provided "as is," with no representations or warranties of any kind. ABNER is now "open source" and released under the terms of the Common Public License. You are free to use the code under those terms. Of course, an acknowledgement is always a good idea:

B. Settles (2005). ABNER: an open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 21(14):3191-3192.

Here is a BibTeX entry if you dig that sort of thing:

@article{settles.bioinf05,
    Author = {B. Settles},
    Journal = {Bioinformatics},
    Number = {14},
    Pages = {3191--3192},
    Title = {{ABNER}: An open source tool for automatically tagging genes, 
        proteins, and other entity names in text},
    Volume = {21},
    Year = 2005}

System Requirements.

The bundled ABNER application is platform-independent, though it requires the Java 2 (J2SE) environment to be installed. It has been tested on Linux, Windows XP, Solaris, and Mac OSX. A modern processor (500MHz+) and 256MB+ of RAM is recommended. Note: If you plan to modify and compile the source code yourself, you will first need the Java SDK 1.4, MALLET 0.3.1, and JLex to be installed and working.

Performance.

The following are complete results for the two trained models included with ABNER v1.5 on their corresponding evaluation corpora using exact boundary matching. ("S-F1" refers to soft F1 scores where at least one boundary is correct, but a one-word error on one side is tolerated.)

Entity	Recall	Precision	F1	(S-F1)
Protein	77.8	68.1	72.6	(84.9)
DNA	63.1	67.2	65.1	(76.1)
RNA	61.9	61.3	61.6	(78.5)
Cell Line	58.2	53.9	56.0	(68.2)
Cell Type	65.6	79.8	72.0	(82.1)
Overall	72.0	69.1	70.5	(82.0)

NLPBA model. Five entities trained on 18,546 sentences, evaluated on 3,856.

Entity	Recall	Precision	F1	(S-F1)
Protein	65.9	74.5	69.9	(83.7)

BioCreative model. One entity (subsuming genes and gene products)
trained on 7,500 sentences, evaluated on 2,500.

History.

March 2005 - ABNER v1.5. (Combined NLPBA and BioCreative models, improved performance, made tokenization optional, introduced API, released open source.)
July 2004 - YAGI v1.0. (A command-line tool trained on the BioCreative corpus.)
June 2004 - ABNER v1.0. (GUI wrapper for the original NLPBA system.)

Bugs.

If you encounter a java.lang.OutOfMemoryError, you may need to increase set your JVM's memory allocation (100MB seems to work well). To do this at the command line: java -Xmx100m -jar abner.jar

Otherwise, I know of no bugs (yet). If you discover any, or would like to contribute fixes and/or functional improvements, please contact: bsettles@cs.wisc.edu.

Similar Software.

I am aware of a few other publicly available biomedical NER programs. ABNER is known to perform as well or better on comparable corpora, and is open-source with a customizable API.

GAPSCORE (Chang et al., 2004) - http://bionlp.stanford.edu/gapscore/
LingPipe (Alias-i Inc., 2003) - http://www.alias-i.com/lingpipe/
AbGene (Tanabe & Wilbur, 2002) - ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/AbGene/
KeX (Fukuda et al., 1998) - http://www.hgc.jp/service/tooldoc/KeX/intro.html

Acknowledgements.

Thanks to Mark Craven for his advice, and Andrew McCallum and Aron Culotta for answering questions about MALLET (The toolkit that implements the CRF). Research related to this software was supported by NLM grant 5T15LM007359 and NIH grant R01 LM07050-01.

View this page in Romanian courtesy of azoft

References.

B. Settles (2005). ABNER: an open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 21(14):3191-3192., 2005.
B. Settles (2004). Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), Geneva, Switzerland, pages 104-107.
J. Lafferty, A. McCallum, & F. Pereira (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the International Conference on Machine Learning (ICML), Williamstown, MA, USA, pages 282-289.