abner
Class Trainer

java.lang.Object
  extended byabner.Trainer

public class Trainer
extends java.lang.Object

The Trainer class will train a CRF to extract entities from a customized dataset. The input file must be tokenized with one sentence per line, with a "|" (vertical pipe) separating a word/token from its label. The first token of an entity name should have a label beginning with "B-", all other entity token labels should begin with "I-", and non-entity tokens should be labeled with "O":

   IL-2|B-DNA gene|I-DNA expression|O and|O NF-kappa|B-PROTEIN B|I-PROTEIN activation|O ...
   


Constructor Summary
Trainer()
           
 
Method Summary
 void train(java.lang.String trainFile, java.lang.String modelFile)
          Takes input trainFile (format described above), and saves a trained linear-chain CRF on the data using ABNER's default feature set in the corresponding output modelFile.
 void train(java.lang.String trainFile, java.lang.String modelFile, java.lang.String[] tags)
          Identical to the other train routine, but the set of tags (e.g.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Trainer

public Trainer()
Method Detail

train

public void train(java.lang.String trainFile,
                  java.lang.String modelFile)

Takes input trainFile (format described above), and saves a trained linear-chain CRF on the data using ABNER's default feature set in the corresponding output modelFile.

Warning: training will take several hours, perhaps even days to complete depending on corpus size and number of entity tags.


train

public void train(java.lang.String trainFile,
                  java.lang.String modelFile,
                  java.lang.String[] tags)

Identical to the other train routine, but the set of tags (e.g. "PROTEIN", "DNA", etc.) allows the model to periodically output progress in terms of precision/recall/f1 during training. Note: do not use "B-" or "I-" prefixes.