This archive has data files for the Component, Function and Process
tasks described in:

S. Ray & M. Craven (2005). Supervised versus Multiple-Instance
Learning: An Empirical Comparison.  Appears in the Proceedings of the
22nd International Conference on Machine Learning, Bonn, Germany.

These data files are generated while building the system described in:

S. Ray & M. Craven (2005).  Learning Statistical Models for Annotating
Proteins with Function Information using Biomedical Text.  Appears in
BMC Bioinformatics, Vol 6 (Suppl 1).

In particular, they are used to train the classifiers in the final
step (in the work above, Naive Bayes classifiers were used and the
multiple-instance aspect was not considered). The task is to
decide whether a given <protein, document> pair should be annotated
with some Gene Ontology (GO) code. As input, we have paragraphs of
documents, each paragraph described by a feature vector. Features used
are word occurrence frequencies and some statistics about the nature of
the protein-GO code interaction for each paragraph. Each document
corresponds to a bag and each paragraph to an instance in a bag. The
hypothesis is that a bag should be annotated with a GO code iff there
exists a paragraph in it that supports this annotation. Conversely, if
no paragraph supports such an annotation, the document should not be
annotated. 

The data files are in C4.5 format. For each of the Component, Function
and Process hierarchies in GO, there is a train/test pair of files
with a names file listing the features used. There are two index
features: bag_id and instance_id. Instances with the same bag_id
belong to the same bag.

Send questions to sray@biostat.wisc.edu.

