KDD Cup 2001 Q & A: Question Period 2

This page lists questions that were asked during Question Period 2, together with answers. Only questions of general interest appear here, without duplication; numerous other questions are omitted, as well as numerous other phrasings of questions that already appear here.

Q: For function prediction, the README file gives as an example of a function '"Auxotrophies, carbon and"'. But this is not a function, it is a phenotype. What's up with this?
A: This was my (David's) mistake. I wanted to give an example that involved the double-quotes, but I mistakenly copied out the wrong string. I should have used as the example '"CELL GROWTH CELL DIVISION AND DNA SYNTHESIS "'. So don't let this confuse you. But don't worry if you omitted the double-quotes or comma (see next answer).

Q: The double-quotes in some of the function names look odd. Are we supposed to include them? And what about the blanks at the end of some of the strings, such as in '"CELL GROWTH CELL DIVISION AND DNA SYNTHESIS "'? And what about commas? In the Genes_relation.data file, there was a comma in this function after CELL GROWTH. What should we do?
A: Use the function names as they appear in the .data file, e.g., use '"CELL GROWTH CELL DIVISION AND DNA SYNTHESIS "'. But actually we have written our scoring code so it can handle the case where you omit the double quotes, and so that it only looks at enough of the string to distinguish it from the other functions, so that no one should be penalized for differences in punctuation or decisions about trailing blanks, etc.

Q: Regarding Dataset 2, you suggested using features which depend on functions/localizations of the genes that gene G interacts with. While I certainly can do this for the training set, it will be impossible to do this for test set, since I will not know the function/localization... Is there something I do not understand?
A: For any gene G in the test set, you will know function and localization for the training set genes with which G interacts. So if a test set gene G interacts with a training set gene G1 that has function F, then you might infer the test set gene G has function F. You can also "pull yourself up by your boostraps" as follows. If G interacts with another test set gene G2, and you have a high-confidence prediction that G2 has function F, you might infer that G has function F.

Q: Will you weight the accuracy score for Task 2 and Task 3 in the same way you do for Task 1?
A: No. Let's go through these in detail, starting with Task 3 because it is easier. Every gene (actually, protein) in the test set has exactly one localization. For each gene, your prediction is either correct or incorrect. Accuracy is simply the fraction of localizations that are correctly predicted. (A non-prediction for a gene counts as an incorrect prediction.) The highest accuracy wins. Now let's go to Task 2. Because most proteins have multiple functions, we consider how many of the possible (protein,function) pairs are correctly predicted. If you include a (protein,function) pair that is known to be correct, i.e., that appears in our key, this is a true positive. If you include a pair that is not in our key, this is a false positive. If you fail to include a pair that appears in our key, this is a false negative. And you get credit for each pair that you do not predict that also does not appear in our key -- this is a true negative. The accuracy of your predictions is just the standard (true positive + true negative)/(true positive + true negative + false positive + false negative). It is worth noting that we very seriously considered using a second scoring function, a weighted accuracy, for this task. But we decided not to use a second function because we saw no compelling reason to assume that errors of omission are any more or less costly than errors of commission for this task.

Q: Regarding Task 3, if my model fails to predict localization for some gene, how should I specify this?
A: Just don't include any entry for that gene in your results file. But because of the scoring function, you might as well just guess a localization for that gene (e.g., the localization that appears most often in the training set).

Q: Regarding the test set for Dataset 2, if we use the composite variables with the number of interactions... On the training set, I assume that these variables only take a count of the number of interactions with training genes. What about on the test set? It seems they should count only the interactions with the training set genes, in order for the numbers to be comparable. If this is not the case, could you please create a version of the test set in which this is the case.
A: These variables (even when appearing in the test set) do indeed count only interactions with training set genes, in order to maintain consistency.

Q: Regarding Task 1, in the first question period, you made reference to the fact that there would be more inactives than actives in the test set. I just want to make sure you're sticking by that statement, and I can count on that fact.
A: We're sticking by this statement -- we're using exactly the test set that we had in mind all along. There are more inactives than actives. But because the test set molecules were synthesized after the chemists looked at the activity levels of the training set molecules, you can expect there's a higher fraction of actives in the test set than in the training set. As we mentioned before, this makes matters tougher than under the fairly standard assumption that the test set is drawn according to the same distribution as the training set (or that it's a held-out set drawn randomly, uniformly, without replacement). But we're using it because, as stated before, it models the real world setting where this type of task arises. Can the data mining systems do better than the chemists alone (can they make a contribution that will be useful to the chemists)? If your predictions are strong on this test set, it indicates that your model would have been useful to the chemists in choosing the next round of compounds to make.

Last Changed: June 12, 2001 by dpage@cs.wisc.edu