KDD Cup 2001 Winners
The KDD Cup summary presentation from KDD-2001 is available in powerpoint, postscript, or pdf. Winners' presentations are available below.
KDD Cup 2001 Honorable Mention
Chairs
Tasks
KDD Cup 2001 involves 3 tasks, based on two data sets. The two training
datasets are available from the links below, as zip files.
The first dataset is a little over half a gigabyte
when uncompressed and comes as a single text file, with one row per record
and fields separated by commas. The second is a little over 7 megabytes
uncompressed. It includes a single text file with all the data; again, the
format is one row per record with comma-separated fields. But this data
set is quite relational in nature, so improved accuracy may be possible by
constructing more complex features or using a relational data mining
technique (see the README file that comes with it). Nevertheless, we've
tried to pre-compute some of the interesting relations as added fields, so
that standard feature-vector algorithms can compete well. For both datasets,
"names" files also are included that give the names of the fields; the names
are "meaningful" only for the second dataset. For both datasets a README
file is included that describes the nature of the task. The README files
are repeated at the bottom of this page for those who wish to read about
the data/task before choosing to download the data.
Training Data
Dataset 1: Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin
Dataset 2: Prediction of Gene/Protein Function and Localization
Test Data
Each test set comes in a zip file with the test data and a README file.
The README file describes the format and manner in which predictions
for that test set should be submitted. Each person may submit only one
prediction file per task (for a total of at most 3 submissions per person).
Test Data for Dataset 2 (both tasks): Function and Localization
Answers
The following are the keys that were used for scoring. Several points
are worth noting regarding Function and Localization keys. First,
submissions varied widely in the use of punctuation, case, and spelling for
function and localization names. Because of this variation, we decided
to have our code remove punctuation and look at only a long enough prefix
of a name to distinguish it from all others -- the name was then converted into a shorter standard form. These shorter forms
are the ones given in the keys below. We also handchecked entries and converted
forms. Second, one gene in the test set had two localizations
(contradicting our earler statement that each gene had only one localization).
For this gene, the predicted localization was counted correct if it matched *either* of
the correct localizations. Third, one function appeared in a test set gene
but in no training set gene. This of course made it impossible to
get 100% accuracy, but everyone was subject to this same constraint, and
we think it just goes with the territory of a real-world task :-)
Schedule
Answers to Questions of General Interest from Question Period 1
Answers to Questions of General Interest from Question Period 2
Further Details
Description of Dataset 1: Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin
Drugs are typically small organic molecules that achieve their desired
activity by binding to a target site on a receptor. The first step in
the discovery of a new drug is usually to identify and isolate the
receptor to which it should bind, followed by testing many small
molecules for their ability to bind to the target site. This leaves
researchers with the task of determining what separates the active
(binding) compounds from the inactive (non-binding) ones. Such a
determination can then be used in the design of new compounds that not
only bind, but also have all the other properties required for a drug
(solubility, oral absorption, lack of side effects, appropriate duration
of action, toxicity, etc.).
The present training data set consists of 1909 compounds tested for their ability to bind to a target site on thrombin, a key receptor in blood clotting. The chemical structures of these compounds are not necessary for our analysis and are not included. Of these compounds, 42 are active (bind well) and the others are inactive. Each compound is described by a single feature vector comprised of a class value (A for active, I for inactive) and 139,351 binary features, which describe three-dimensional properties of the molecule. The definitions of the individual bits are not included - we don't know what each individual bit means, only that they are generated in an internally consistent manner for all 1909 compounds. Biological activity in general, and receptor binding affinity in particular, correlate with various structural and physical properties of small organic molecules. The task is to determine which of these properties are critical in this case and to learn to accurately predict the class value. To simulate the real-world drug design environment, the test set contains 636 additional compounds that were in fact generated based on the assay results recorded for the training set. In evaluating the accuracy, a differential cost model will be used, so that the sum of the costs of the actives will be equal to the sum of the costs of the inactives. In other words, it is just as important to minimize your error rate on the actives as it is to minimize your error rate on the inactives, even though the training set contains more inactive than actives (and the test set might also).
We thank DuPont Pharmaceuticals for graciously providing this data set
for the KDD Cup 2001 competition. All publications referring to
analysis of this data set should acknowledge DuPont Pharmaceuticals
Research Laboratories and KDD Cup 2001.
Description of Dataset 2: Prediction of Gene/Protein Function and Localization
The genomes of several organisms have now been completely sequenced, including
the human genome -- depending on one's definition of "completely" :-). Interest
within bioinformatics is therefore shifting somewhat away from sequencing, to
learning about the genes encoded in the sequence. Genes code for proteins, and
these proteins tend to localize in various parts of cells and interact with one
another, in order to perform crucial functions. The present data set consists of
a variety of details about the various genes of one particular type of organism.
Gene names have been anonymized and a subset of the genes have been withheld for
testing. The two tasks are to predict the functions and localizations of the
proteins encoded by the genes. A gene/protein can have more than one function,
but (at least in this data set) only one localization. The other information from
which function and localization can be predicted includes the class of the
gene/protein, the phenotype (observable characteristics) of individuals with a
mutation in the gene (and hence in the protein), and the other proteins with
which each protein is known to interact.
The full data set is in Full_File.data. But please notice that the task is quite "relational." For example, one might wish to learn a rule that says a gene G has function F if G interacts with another gene G1 that has function F. We have made an effort to build such features into Full_File.data. (For example, for each gene we give the number of interacting genes with a given function -- these features are probably useful for predicting at least one or two of the functions). But participants may wish to construct their own additional features or to use a relational data mining algorithm. While this certainly can be done from Full_File.data, it may be easier to do this from the relational tables that we used to build Full_File.data. These are in Genes_relation.data and Interactions_relation.data. Each of the data files has a corresponding names file as well.
Detailed knowledge of the biology should not be necessary for this application. This is so much the case that we almost even anonymized all the other fields as well as the gene field. But in the end we decided instead to leave the other fields alone, since this might make the data set more interesting. One word of caution: your predictor for function should not use localization, and your predictor for localization should not use function, since *both* fields will be withheld from the test genes when they are provided. Also note that, because a gene may have more than one function, we will test for correct prediction of every (gene, function) pair. By the time we provide the test data, we will provide full specification of the format for submission of your predictions.