ABNER: A Biomedical Named Entity Recognizer

Version 1.5 released! The new version sports performance improvements, more customizability, plus a new API to train ABNER on other corpora and incorporate them into your systems.

ABNER is a software tool for molecular biology text analysis. It began as a user-friendly interface for a system developed as part of the NLPBA/BioNLP 2004 Shared Task challenge. The details of that system are described in the paper below (Settles, 2004).

At ABNER's core is a statistical machine learning system using linear-chain conditional random fields (CRFs) with a variety of orthographic and contextual features. Version 1.5 includes two models trained on the NLPBA and BioCreative corpora, for which performance is roughly state of the art (F1 scores of 70.5 and 69.9 respectively; details here). The new version also includes a Java API allowing users to incorporate ABNER into their systems, as well as train and use models for other data. Here's a pretty screenshot:

[ABNER screenshot]

Features

Download and Legalese

Download Version 1.5
March 2005 [Java archive, 9.5mb]

Documentation for the API is available here (javadoc).
Java source code is available as a gzipped tarball: abner-1.5.tar.gz (32.8kb).
Note: You don't need the source to access the API, just make sure "abner.jar" is in your classpath.

This software © 2004 by Burr Settles, Department of Computer Sciences, University of Wisconsin-Madison. It is provided "as is," with no representations or warranties of any kind. ABNER is now "open source" and released under the terms of the Common Public License. You are free to use the code under those terms. Of course, an acknowledgement is always a good idea:

B. Settles. ABNER: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 21(14):3191-3192. 2005.

Here is a BibTeX entry if you dig that sort of thing:

@article{settles.bioinf05,
    Author = {B. Settles},
    Journal = {Bioinformatics},
    Number = {14},
    Pages = {3191--3192},
    Title = {{ABNER}: An open source tool for automatically tagging genes, 
        proteins, and other entity names in text},
    Volume = {21},
    Year = 2005}

System Requirements

The bundled ABNER application is platform-independent, though it requires the Java 2 (J2SE) environment to be installed. It has been tested on Linux, Windows XP, Solaris, and Mac OSX. A modern processor (500MHz+) and 256MB+ of RAM is recommended. Note: If you plan to modify and compile the source code yourself, you will first need the Java SDK 1.4, MALLET 0.3.1, and JLex to be installed and working.

Performance

The following are complete results for the two trained models included with ABNER v1.5 on their corresponding evaluation corpora using exact boundary matching. ("S-F1" refers to soft F1 scores where at least one boundary is correct, but a one-word error on one side is tolerated.)

NLPBA model. Five entities trained on 18,546 sentences, evaluated on 3,856.
EntityRecallPrecisionF1(S-F1)
Protein77.868.172.6(84.9)
DNA63.167.265.1(76.1)
RNA61.961.361.6(78.5)
Cell Line58.253.956.0(68.2)
Cell Type65.679.872.0(82.1)
Overall72.069.170.5(82.0)

BioCreative model. One entity trained on 7,500 sentences, evaluated on 2,500.
EntityRecallPrecisionF1(S-F1)
Protein65.974.569.9(83.7)

History

Bugs

If you encounter a java.lang.OutOfMemoryError, you may need to increase set your JVM's memory allocation (100MB seems to work well). To do this at the command line: java -Xmx100m -jar abner.jar

Similar Software

I am aware of a few other publicly available biomedical NER programs. ABNER is known to perform as well or better on comparable corpora, and is open-source with a customizable API.

Acknowledgements

Thanks to Mark Craven for his advice, and Andrew McCallum and Aron Culotta for answering questions about MALLET (The toolkit that implements the CRF). Research related to this software was supported by NLM grant T15-LM007359 and NIH grant R01-LM07050.

References