ABNER is a software tool for molecular biology text analysis. It began as a user-friendly interface for a system developed as part of the NLPBA/BioNLP 2004 Shared Task challenge. The details of that system are described in the paper below (Settles, 2004).
At ABNER's core is a statistical machine learning system using linear-chain conditional random fields (CRFs) with a variety of orthographic and contextual features. Version 1.5 includes two models trained on the NLPBA and BioCreative corpora, for which performance is roughly state of the art (F1 scores of 70.5 and 69.9 respectively; details here). The new version also includes a Java API allowing users to incorporate ABNER into their systems, as well as train and use models for other data. Here's a pretty screenshot:
ABNER v1.5 (March 2005) is available as a bundled Java
archive: abner.jar (9.5mb).
To run
it: execute this command from a terminal: java [-Xmx100m] -jar
abner.jar
Documentation for the API is available here (javadoc).
Java source code is
available as a gzipped tarball: abner-1.5.tar.gz (32.8kb).
Note:
You don't need the source to access the API, just make sure
"abner.jar" is in your classpath.
This software © 2004 by Burr Settles, Department of Computer Sciences, University of Wisconsin-Madison. It is provided "as is," with no representations or warranties of any kind. ABNER is now "open source" and released under the terms of the Common Public License. You are free to use the code under those terms. Of course, an acknowledgement is always a good idea:
B. Settles (2005). ABNER: an open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 21(14):3191-3192.
Here is a BibTeX entry if you dig that sort of thing:
@article{settles.bioinf05, Author = {B. Settles}, Journal = {Bioinformatics}, Number = {14}, Pages = {3191--3192}, Title = {{ABNER}: An open source tool for automatically tagging genes, proteins, and other entity names in text}, Volume = {21}, Year = 2005}
The bundled ABNER application is platform-independent, though it requires the Java 2 (J2SE) environment to be installed. It has been tested on Linux, Windows XP, Solaris, and Mac OSX. A modern processor (500MHz+) and 256MB+ of RAM is recommended. Note: If you plan to modify and compile the source code yourself, you will first need the Java SDK 1.4, MALLET 0.3.1, and JLex to be installed and working.
The following are complete results for the two trained models included with ABNER v1.5 on their corresponding evaluation corpora using exact boundary matching. ("S-F1" refers to soft F1 scores where at least one boundary is correct, but a one-word error on one side is tolerated.)
Entity | Recall | Precision | F1 | (S-F1) |
---|---|---|---|---|
Protein | 77.8 | 68.1 | 72.6 | (84.9) |
DNA | 63.1 | 67.2 | 65.1 | (76.1) |
RNA | 61.9 | 61.3 | 61.6 | (78.5) |
Cell Line | 58.2 | 53.9 | 56.0 | (68.2) |
Cell Type | 65.6 | 79.8 | 72.0 | (82.1) |
Overall | 72.0 | 69.1 | 70.5 | (82.0) |
Entity | Recall | Precision | F1 | (S-F1) |
---|---|---|---|---|
Protein | 65.9 | 74.5 | 69.9 | (83.7) |
If you encounter a java.lang.OutOfMemoryError, you may need to increase set your JVM's memory allocation (100MB seems to work well). To do this at the command line: java -Xmx100m -jar abner.jar
Otherwise, I know of no bugs (yet). If you discover any, or would like to contribute fixes and/or functional improvements, please contact: bsettles@cs.wisc.edu.
I am aware of a few other publicly available biomedical NER programs. ABNER is known to perform as well or better on comparable corpora, and is open-source with a customizable API.
Thanks to Mark Craven for his advice, and Andrew McCallum and Aron Culotta for answering questions about MALLET (The toolkit that implements the CRF). Research related to this software was supported by NLM grant 5T15LM007359 and NIH grant R01 LM07050-01.
View this page in Romanian courtesy of azoft