KDD Cup 2001 Q & A (Question Period 1)

This page lists questions that were asked during Question Period 1, together with answers. Only questions of general interest appear here, without duplication; numerous other questions are omitted, as well as numerous other phrasings of questions that already appear here.

Q: I would like to ask whether one can focus on the analysis of just one of the two dataset in order to participate to the KDD Cup.
A: Certainly. You may submit predictions for one, two, or all three tasks. Three winners will be identified, one for each of the three tasks: (1) prediction of active compounds for Dataset 1, (2) prediction of function for Dataset 2, (3) prediction of localization for Dataset 2.

Q: Regarding Dataset 2, if the test data will contain the table similar to Interactions_relation will it contain data on interactions (test gene A - test gene B) in addition to interactions (test gene - learning gene) or the latter kind of interactions only?
A: The test data will contain interactions of both types, test genes with training genes and test genes with other test genes.

Q: Regarding Dataset 1, can we expect that the ratio of the number of active compounds to the number of inactive compunds in the final test set to be roughly same as the ratio in the given training set?
A: No, and to keep this similar to the real-world scenario, we don't want to say what the ratio is. But being generous people :-), we will give away 2 items of information. (1) The compounds in the test set were made after chemists saw the activity results for the training set, so as you might expect the test set has a higher fraction of actives than did the training set. (2) Nevertheless, the test set still has more inactives than actives. We realize our method of testing makes the task tougher than it would be with the standard assumption that test data are drawn according to the same distribution as the training data. But this is a common scenario for those working in the pharmaceutical industry.

Q: How will the entries be judged? Accuracy, speed of computation, conciseness of rule?
A: Entirely by test set accuracy, or 1 - error. But please note that for Dataset 1, error will be the sum of error on actual actives and error on actual inactives. Thus, for example, if there are 10 actives and 100 inactives in the test set, then each active will effectively count 10 times as much as each inactive.

Q: How do we submit our entries? Do we ftp our algorithm to you or just simply the test results? What are the constraints? Processing window? Hardware?
A: You will send us simply the test results in a text file. We will specify the format of this file when we provide the test data. You may arrive at your predictions for the test data in any way you like (we impose no constraints).

Q: What is an essential gene and what is a complex?
A: An essential gene is one without which the organism dies. Some proteins are complexes of several peptides (each encoded by a single gene). So if several genes have the same complex, it means they code for different parts of the same protein. Your data mining system should get good use of these fields without your having to give the fields any kind of special treatment based on domain knowledge. You may just treat them as nominal (discrete-valued) fields with the possible values listed in the Genes_relation.names file. The same is true for phenotype, class, motif, etc.

Q: For Dataset 2, what is the meaning of the Corr (real-valued) field for two interacting genes.
A: This is the correlation between gene expression patterns for the two genes. A correlation far from 0 implies that these genes are likely to influence one another strongly.

Q: For Dataset 1, assume that the test set contains NA active substances and NI inactive ones. My procedure classifies correctly NAcor of NA active substances and NIcor of NI inactive substances. Is it correct that the measure of error of my procedure would be Err = (NA - NAcorr)/NA + (NI - NIcor)/NI ? Is it Err I should minimize?
A: Yes. The person/group that minimizes this on the test set wins. This is equivalent to minimizing the ordinary error with differential costs. For example, if the test set contains 10 actives and 100 inactives, we want to maximize accuracy (minimize ordinary error) where each active counts 10 times but each inactive only once. This is the standard accuracy with differential misclassification costs as is used throughout data mining and machine learning.

Q: I was looking at the Thrombin data set, and found that 593 of the 1909 samples are all zero (i.e., none of the bits are 1). Included in this set of 593 are two Active compounds. Is this correct, or could it be a mistake?
A: This is correct. Clearly one cannot get 100% training set accuracy because of this, but one probably can remove the two all-zero actives from the training set without incurring any disadvantage. I would not suggest removing any of the all-zero inactives, since their contributions to the frequencies of the different attributes are important (at least for frequency-based approaches such as decision/classification trees).

Q: A small question w.r.t. dataset 2. Some attributes have missing values (now denoted by a '?'), but there are also 'Unknown' values. What is the difference between these?
A: There is no difference between 'unknown' and missing values. In some cases, the experimentalist assigned the particular gene to the class 'unknown'. In other cases where the class was not known it was left blank. In the data cleansing step both should be assigned to missing values.

Q: Is there an error concerning gene G239017? I found two localizations given for this gene.
A: According to the classification scheme used, a gene can have multiple localizations. Actually the localizations refer to the gene product (protein) rather than the gene itself (e.g. a cytoplasmic transcription factor that moves into the nucleus given a specific signal). Nevertheless, almost all of the proteins have only one localization, and we will ensure that each protein in the test set has only one localization.

Q: Just for completeness: will all functions and localizations of test examples come from the set of functions and localizations of training examples? There could be at least one different case, indicated to me by the occurrence of "TRANSPOSABLE ELEMENTS VIRAL AND PLASMID PROTEINS " in the file Genes_relation.names as one possible value for the function attribute, while I did not find this value for any training example.
A: It is possible that some cases are not represented in the training set. This reflects the real situation where genes of a given class have not been found or confirmed yet.

Q: I am a bit puzzled by some issues of reflexive and symmetrical interaction relations contained in Interactions_relation.data. As far as I can see, there are 44 cases of genes interacting with themselves. (Why not all?) Moreover, I found 14 gene pairs, where gene#1 interacts with gene#2, and also gene#2 with gene#1. (Again, why are these cases just sporadic?) Could you maybe give some background information about these matters?
A: Interactions are not necessarily reflexive. Certain protein molecules bind to form homo-dimers. All interactions are symmetrical however. If gene1 interacts with gene2, then gene2 interacts with gene1. We have tried to list those interactions only once. However, in some cases both pairs made it to the final table.These should be considered as duplicate records.

Q: On the thrombin data set, is there any particular order in the way 139351 binary features were generated?
A: No.

Last Changed: June 12, 2001 by dpage@cs.wisc.edu