This page lists questions that were asked during Question Period 1, together with answers. Only questions of general interest appear here, without duplication; numerous other questions are omitted, as well as numerous other phrasings of questions that already appear here.
I would like to ask whether one can focus on the analysis of just one of the
two dataset in order to participate to the KDD Cup.
A: Certainly. You may submit predictions for one, two, or all three tasks. Three winners will be identified, one for each of the three tasks: (1) prediction of active compounds for Dataset 1, (2) prediction of function for Dataset 2, (3) prediction of localization for Dataset 2.
Q: Regarding Dataset 2, if the test data will contain the table similar to
Interactions_relation will it contain data on interactions
(test gene A - test gene B) in addition to interactions (test gene -
learning gene) or the latter kind
of interactions only?
A: The test data will contain interactions of both types, test genes with training genes and test genes with other test genes.
Regarding Dataset 1, can we expect that the ratio of the number of active compounds
to the number of inactive compunds in the final test set to be
roughly same as the ratio in the given training set?
A: No, and to keep this similar to the real-world scenario, we don't want to say what the ratio is. But being generous people :-), we will give away 2 items of information. (1) The compounds in the test set were made after chemists saw the activity results for the training set, so as you might expect the test set has a higher fraction of actives than did the training set. (2) Nevertheless, the test set still has more inactives than actives. We realize our method of testing makes the task tougher than it would be with the standard assumption that test data are drawn according to the same distribution as the training data. But this is a common scenario for those working in the pharmaceutical industry.
How will the entries be judged? Accuracy, speed of
computation, conciseness of rule?
A: Entirely by test set accuracy, or 1 - error. But please note that for Dataset 1, error will be the sum of error on actual actives and error on actual inactives. Thus, for example, if there are 10 actives and 100 inactives in the test set, then each active will effectively count 10 times as much as each inactive.
How do we submit our entries? Do we ftp our algorithm to you or
just simply the test results? What are the constraints? Processing window?
A: You will send us simply the test results in a text file. We will specify the format of this file when we provide the test data. You may arrive at your predictions for the test data in any way you like (we impose no constraints).
What is an essential gene and what is a complex?
A: An essential gene is one without which the organism dies. Some proteins are complexes of several peptides (each encoded by a single gene). So if several genes have the same complex, it means they code for different parts of the same protein. Your data mining system should get good use of these fields without your having to give the fields any kind of special treatment based on domain knowledge. You may just treat them as nominal (discrete-valued) fields with the possible values listed in the Genes_relation.names file. The same is true for phenotype, class, motif, etc.
For Dataset 2, what is the meaning of the Corr (real-valued) field for
two interacting genes.
A: This is the correlation between gene expression patterns for the two genes. A correlation far from 0 implies that these genes are likely to influence one another strongly.
For Dataset 1, assume that the test set contains NA active substances
and NI inactive ones. My procedure classifies correctly NAcor of NA
active substances and NIcor of NI inactive substances. Is it correct
that the measure of error of my procedure would be
Err = (NA - NAcorr)/NA + (NI - NIcor)/NI ?
Is it Err I should minimize?
A: Yes. The person/group that minimizes this on the test set wins. This is equivalent to minimizing the ordinary error with differential costs. For example, if the test set contains 10 actives and 100 inactives, we want to maximize accuracy (minimize ordinary error) where each active counts 10 times but each inactive only once. This is the standard accuracy with differential misclassification costs as is used throughout data mining and machine learning.
I was looking at the Thrombin data set, and found that 593 of the 1909
samples are all zero (i.e., none of the bits are 1). Included in this
set of 593 are two Active compounds. Is this correct, or could it be
A: This is correct. Clearly one cannot get 100% training set accuracy because of this, but one probably can remove the two all-zero actives from the training set without incurring any disadvantage. I would not suggest removing any of the all-zero inactives, since their contributions to the frequencies of the different attributes are important (at least for frequency-based approaches such as decision/classification trees).
A small question w.r.t. dataset 2. Some attributes have missing values (now denoted by a '?'), but there are also 'Unknown' values. What is the difference between these?
A: There is no difference between 'unknown' and missing values. In some cases, the experimentalist assigned the particular gene to the class 'unknown'. In other cases where the class was not known it was left blank. In the data cleansing step both should be assigned to missing values.
Is there an error concerning gene G239017? I found two localizations given for this gene.
A: According to the classification scheme used, a gene can have multiple localizations. Actually the localizations refer to the gene product (protein) rather than the gene itself (e.g. a cytoplasmic transcription factor that moves into the nucleus given a specific signal). Nevertheless, almost all of the proteins have only one localization, and we will ensure that each protein in the test set has only one localization.
Just for completeness: will all functions and localizations of
test examples come from the set of functions and localizations
of training examples? There could be at least one different case,
indicated to me by the occurrence of
"TRANSPOSABLE ELEMENTS VIRAL AND PLASMID PROTEINS "
in the file Genes_relation.names as one possible value for the
function attribute, while I did not find this value for any training
A: It is possible that some cases are not represented in the training set. This reflects the real situation where genes of a given class have not been found or confirmed yet.
I am a bit puzzled by some issues of reflexive and symmetrical
interaction relations contained in Interactions_relation.data. As far as
I can see, there are 44 cases of genes interacting with themselves. (Why
not all?) Moreover, I found 14 gene pairs, where gene#1 interacts with
gene#2, and also gene#2 with gene#1. (Again, why are these cases just
sporadic?) Could you maybe give some background information about
A: Interactions are not necessarily reflexive. Certain protein molecules bind to form homo-dimers. All interactions are symmetrical however. If gene1 interacts with gene2, then gene2 interacts with gene1. We have tried to list those interactions only once. However, in some cases both pairs made it to the final table.These should be considered as duplicate records.
On the thrombin data set, is there any particular order in the way 139351
binary features were generated?