This page lists questions that were asked during Question Period 2, together with answers. Only questions of general interest appear here, without duplication; numerous other questions are omitted, as well as numerous other phrasings of questions that already appear here.
Q: For function prediction, the README file gives as an example of a
function '"Auxotrophies, carbon and"'. But this is not a function, it
is a phenotype. What's up with this?
A: This was my (David's) mistake. I wanted to give an example that
involved the double-quotes, but I mistakenly copied out the wrong
string. I should have used as the example '"CELL GROWTH CELL DIVISION AND DNA SYNTHESIS "'. So don't let this confuse you. But don't worry if
you omitted the double-quotes or comma (see next answer).
Q: The double-quotes in some of the function names look odd. Are we
supposed to include them? And what about the blanks at the end of some
of the strings, such as in '"CELL GROWTH CELL DIVISION AND DNA SYNTHESIS   "'? And what about commas? In the Genes_relation.data
file, there was a comma in this function after CELL GROWTH. What should we do?
A: Use the function names as they appear in the .data file, e.g., use
'"CELL GROWTH CELL DIVISION AND DNA SYNTHESIS  "'. But actually
we have written our scoring code so it can handle the case where you
omit the double quotes, and so that it only looks at enough of the
string to distinguish it from the other functions, so that no one
should be penalized for differences in punctuation or decisions
about trailing blanks, etc.
Q: Regarding Dataset 2, you suggested using features which depend
on functions/localizations of the genes that gene G interacts with.
While I certainly can do this for the training set, it will be impossible
to do this for test set, since I will not know the function/localization...
Is there something I do not understand?
A: For any gene G in the test set, you will know function and localization
for the training set genes with which G interacts. So if a test set gene G
interacts with a training set gene G1 that has function F, then you might
infer the test set gene G has function F. You can also "pull yourself up
by your boostraps" as follows. If G interacts with another test set gene G2,
and you have a high-confidence prediction that G2 has function F, you might
infer that G has function F.
Q: Will you weight the accuracy score for Task 2 and Task 3 in the same way
you do for Task 1?
A: No. Let's go through these in detail, starting with Task 3 because it
is easier. Every gene (actually, protein) in the test set has exactly one
localization. For each gene, your prediction is either correct or incorrect.
Accuracy is simply the fraction of localizations that are correctly
predicted. (A non-prediction for a gene counts as an incorrect prediction.)
The highest accuracy wins. Now let's
go to Task 2. Because most proteins have multiple functions, we consider
how many of the possible (protein,function) pairs are correctly predicted.
If you include a (protein,function) pair that is known to be correct, i.e., that
appears in our key, this is a true positive. If you include a pair that is
not in our key, this is a false positive. If you fail to include a pair
that appears in our key, this is a false negative. And you get credit for
each pair that you do not predict that also does not appear in our key -- this
is a true
negative. The accuracy of your predictions is just the standard
(true positive + true negative)/(true positive + true negative + false positive + false negative).
It is worth noting that we very seriously considered using a second
scoring function, a weighted accuracy, for this task. But we decided not
to use a second function because we saw no compelling reason to assume that
errors of omission are any more or less costly than errors of commission
for this task.
Q: Regarding Task 3, if my model fails to predict localization for some
gene, how should I specify this?
A: Just don't include any entry for that gene in your results file.
But because of the scoring function, you might as well just guess a
localization for that gene (e.g., the localization that appears most
often in the training set).
Q: Regarding the test set for Dataset 2, if we use the composite variables
with the number of interactions... On the training set, I assume that these
variables only take a count of the number of interactions with training
genes. What about on the test set? It seems they should count only the
interactions with the training set genes, in order for the numbers to
be comparable. If this is not the case, could you please create a version
of the test set in which this is the case.
A: These variables (even when appearing in the test set) do indeed count
only interactions with training set genes, in order to maintain consistency.
Q: Regarding Task 1, in the first question period, you made reference to
the fact that there would be more inactives than actives in the test set.
I just want to make sure you're sticking by that statement, and I can
count on that fact.
A: We're sticking by this statement -- we're using exactly the test set
that we had in mind all along. There are more inactives than actives.
But because the test set molecules were synthesized after the chemists
looked at the
activity levels of the training set molecules, you can expect there's a
higher fraction of actives in the test set than in the training set.
As we mentioned before, this makes matters tougher than under the
fairly standard assumption that the test set is drawn according to the
same distribution as the training set (or that it's a held-out set
drawn randomly, uniformly, without replacement). But we're using it
because, as stated before, it models the real world setting where
this type of task arises.
Can the data mining systems do better than the chemists alone (can they
make a contribution that will be useful to the chemists)? If your
predictions are strong on this test set, it indicates that your model
would have been useful to the chemists in choosing the next round of compounds
to make.