ABNER: A Biomedical Named Entity Recognizer

Version 1.5 released! The new version sports performance improvements, more customizability, plus a new API to train ABNER on other corpora and incorporate them into your systems.

ABNER is a software tool for molecular biology text analysis. It began as a user-friendly interface for a system developed as part of the NLPBA/BioNLP 2004 Shared Task challenge. The details of that system are described in the paper below (Settles, 2004).

At ABNER's core is a statistical machine learning system using linear-chain conditional random fields (CRFs) with a variety of orthographic and contextual features. Version 1.5 includes two models trained on the NLPBA and BioCreative corpora, for which performance is roughly state of the art (F1 scores of 70.5 and 69.9 respectively; details here). The new version also includes a Java API allowing users to incorporate ABNER into their systems, as well as train and use models for other data. Here's a pretty screenshot:

[ABNER screenshot]

Features.

Download and Legalese.

ABNER v1.5 (March 2005) is available as a bundled Java archive: abner.jar (9.5mb).
To run it: execute this command from a terminal: java [-Xmx100m] -jar abner.jar

Documentation for the API is available here (javadoc).
Java source code is available as a gzipped tarball: abner-1.5.tar.gz (32.8kb).
Note: You don't need the source to access the API, just make sure "abner.jar" is in your classpath.

This software © 2004 by Burr Settles, Department of Computer Sciences, University of Wisconsin-Madison. It is provided "as is," with no representations or warranties of any kind. ABNER is now "open source" and released under the terms of the Common Public License. You are free to use the code under those terms. Of course, an acknowledgement is always a good idea:

B. Settles (2005). ABNER: an open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 21(14):3191-3192.

Here is a BibTeX entry if you dig that sort of thing:

@article{settles.bioinf05,
    Author = {B. Settles},
    Journal = {Bioinformatics},
    Number = {14},
    Pages = {3191--3192},
    Title = {{ABNER}: An open source tool for automatically tagging genes, 
        proteins, and other entity names in text},
    Volume = {21},
    Year = 2005}

System Requirements.

The bundled ABNER application is platform-independent, though it requires the Java 2 (J2SE) environment to be installed. It has been tested on Linux, Windows XP, Solaris, and Mac OSX. A modern processor (500MHz+) and 256MB+ of RAM is recommended. Note: If you plan to modify and compile the source code yourself, you will first need the Java SDK 1.4, MALLET 0.3.1, and JLex to be installed and working.

Performance.

The following are complete results for the two trained models included with ABNER v1.5 on their corresponding evaluation corpora using exact boundary matching. ("S-F1" refers to soft F1 scores where at least one boundary is correct, but a one-word error on one side is tolerated.)

EntityRecallPrecisionF1(S-F1)
Protein77.868.172.6(84.9)
DNA63.167.265.1(76.1)
RNA61.961.361.6(78.5)
Cell Line58.253.956.0(68.2)
Cell Type65.679.872.0(82.1)
Overall72.069.170.5(82.0)
NLPBA model. Five entities trained on 18,546 sentences, evaluated on 3,856.

EntityRecallPrecisionF1(S-F1)
Protein65.974.569.9(83.7)
BioCreative model. One entity (subsuming genes and gene products)
trained on 7,500 sentences, evaluated on 2,500.

History.

Bugs.

If you encounter a java.lang.OutOfMemoryError, you may need to increase set your JVM's memory allocation (100MB seems to work well). To do this at the command line: java -Xmx100m -jar abner.jar

Otherwise, I know of no bugs (yet). If you discover any, or would like to contribute fixes and/or functional improvements, please contact: bsettles@cs.wisc.edu.

Similar Software.

I am aware of a few other publicly available biomedical NER programs. ABNER is known to perform as well or better on comparable corpora, and is open-source with a customizable API.

Acknowledgements.

Thanks to Mark Craven for his advice, and Andrew McCallum and Aron Culotta for answering questions about MALLET (The toolkit that implements the CRF). Research related to this software was supported by NLM grant 5T15LM007359 and NIH grant R01 LM07050-01.

View this page in Romanian courtesy of azoft

References.