Call the "simulated set" the output of rsem-simulate-reads, and call the
"input set" the reconstructed input that I give to collectContigs. All paths
are relative to
/tier2/deweylab/nathanae/rnaseq-eval/data-sep7.single-0.001-sim.

By running doublecheck_collectContigs_input.py we find that:

num_not_found0 = 23328
num_not_found1 = 0
num_matches = 40130
num_unmatched0 = 5169
num_unmatched1 = 5169

Here:
- num_not_found0 is the number of simulated reads whose sequence does not
  match any element of the input set.
- num_not_found1 is the number of input reads whose sequence does not match
  any element of the simulated set.
- num_matches is the number of reads with exactly the same sequence and
  metadata in the simulated set and the input set.
- num_unmatched0 is the number of simulated reads whose sequence matches at
  least one element of the input set, but the metadata doesn't match.
- num_unmatched1 is the number of input reads whose sequence matches at least
  one element of the simulated set, but the metadata doesn't match.


QUESTION 1. Why are there simulated reads whose sequences match an element of
the input set, but whose metadata don't match? I.e., why is num_unmatched0 not
0?

We pick the following read from the simulated set whose sequence matched one
in the input set, but whose metadata didn't match.

$ cat doublecheck_collectContigs.unmatched0.txt | grep 49904_0_17296_141
49904_0_17296_141       CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG

This matches the following read from the input set:

$ cat doublecheck_collectContigs.unmatched1.txt | grep CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG
49904_0_29255_187       CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG

Let's look for the simulated read's id in the BAM file, since this is where
the reconstructed metadata comes from:

$ samtools view expression.rsem-expr-d/expression.sorted.bam | grep 49904_0_17296_141
49904_0_17296_141       0       XM_001475749.2  142     1       76M     *       0       0       CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG  ?########################################################################### ZW:f:0.15749
49904_0_17296_141       0       XM_003086467.1  143     0       76M     *       0       0       CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG  ?########################################################################### ZW:f:1.68051e-24
49904_0_17296_141       0       NM_009091.2     188     8       76M     *       0       0       CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG  ?########################################################################### ZW:f:0.84251

Hypothesis: we are for some reason wrongly choosing the NM_009091.2 match
instead of the XM_001475749.2 match, and this is what causes the incorrect
metadata.

I don't remember what all these fields mean, so let's look at the bam file in
python.

In [6]: matches = [rec for rec in pysam.Samfile("/tier2/deweylab/nathanae/rnaseq-eval/data-sep7.single-0.001-sim/expression.rsem-expr-d/expression.sorted.bam", "rb") if rec.qname == "49904_0_17296_141"]

In [12]: for m in matches: print m
   49904_0_17296_141       0       17295   141     1       [(0, 76)]       -1      -1      76      CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG  ?########################################################################### [('ZW', 0.15749017894268036)]
   49904_0_17296_141       0       17296   142     0       [(0, 76)]       -1      -1      76      CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG  ?########################################################################### [('ZW', 1.6805095975314162e-24)]
   49904_0_17296_141       0       29254   187     8       [(0, 76)]       -1      -1      76      CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG  ?########################################################################### [('ZW', 0.8425098061561584)]

In [14]: [m.mapq for m in matches]
Out[14]: [1L, 0L, 8L]

So the reason that we chose the NM_009091.2 match is that its alignment to the
simulated read has a higher score according to RSEM than the "true" match does.

But the metadata would have been correct if we had chosen the first one (note
that we set sid=rec.rname+1, and 17296==17295+1).

[Sidenote: Here are the two isoforms:

$ dl-fasta-subset XM_001475749.2 < ../mouse.rna.fa | dl-fasta-reformat -w 0
>XM_001475749.2 PREDICTED: Mus musculus 40S ribosomal protein S15-like, transcript variant 1 (LOC100046223), mRNA.
CAAGATGGCCGAAGTGGAGCAGAAGAAGAAGCGCACCTTCCGCAAGTTCACCTACCGTGGCGTAGACCTCGACCAACTGCTCGACATGTCCTATGAGCAACTGATGCAGCTGTACAGCGCCCGCCAGAGGAGGCGCCTGAACCGTGGTCTTCGGCGGAAGCAACACTCACTGCTCAAGCGCTTGAGAAAGGCCAAAAAGGAGGCACCACCCATGGAGAAGCCTGAGGTGGTGAAGACGCACCTGAGGGACATGATCATCCTGCCGGAGATGGTGGGTAGCATGGTGGGCATGTACAACGGCAAGACCTTCAACCAGGTGGAGATCAAACCAGAGATGATCGGCCACTACCTGGGCGAGTTCTCCATCACCTACAAACCCGTGAAGCACGGCCGGCCCGGGATCGGTGCCACCCACTCCTCCCGATTCATCCCCCTCAAGTAGCCGAGGCCAATAAAGACTGGTTTTGGTCCCTGGAAAAAAAAAAAAAGAA

$ dl-fasta-subset NM_009091.2 < ../mouse.rna.fa | dl-fasta-reformat -w 0
>NM_009091.2 Mus musculus ribosomal protein S15 (Rps15), mRNA.
GATAACTGCGCAAGCGCAGCCCGTACTTCCTTTTCCGAGTAACCGCCAAGATGGCCGAAGTGGAGCAGAAGAAGAAGCGCACCTTCCGCAAGTTCACCTACCGTGGCGTAGACCTCGACCAACTGCTCGACATGTCCTATGAGCAACTGATGCAGCTGTACAGCGCCCGCCAGAGGAGGCGCCTGAACCGTGGTCTTCGGCGGAAGCAACACTCACTGCTCAAGCGCTTGAGAAAGGCCAAAAAGGAGGCACCACCCATGGAGAAGCCTGAGGTGGTGAAGACGCACCTGAGGGACATGATCATCCTGCCGGAGATGGTGGGTAGCATGGTGGGCGTGTACAACGGCAAGACCTTCAACCAGGTGGAGATCAAACCAGAGATGATCGGCCACTACCTGGGCGAGTTCTCCATCACCTACAAACCCGTGAAGCACGGCCGGCCCGGGATCGGTGCCACCCACTCCTCCCGATTCATCCCCCTCAAGTAGCCGAGGCCAATAAAGACTGGTTTTGGTCCCTGGA

Below are listed:
- the substring of XM_001475749.2 from 141 to 141+76
- the substring of NM_009091.2 from 187 to 187+76
- the simulated read
CCGTGGTCTTCGGCGGAAGCAACACTCACTGCTCAAGCGCTTGAGAAAGGCCAAAAAGGAGGCACCACCCATGGAG
CCGTGGTCTTCGGCGGAAGCAACACTCACTGCTCAAGCGCTTGAGAAAGGCCAAAAAGGAGGCACCACCCATGGAG
CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG
Note that the first two, i.e. the two isoform subsequences, are exactly the
same. They differ at 8 bases from the read.

So although I don't know why RSEM gave the alignment to NM_009091.2 a higher
score than that to XM_001475749.2, it is at least not completely surprising
that RSEM did not prefer the "true" alignment.]

----

Let's doublecheck the above with another random read from the simulated set
whose sequence matched an element of the input set, but whose metadata didn't
match, say 53001_1_18063_140.

$ cat doublecheck_collectContigs.unmatched0.txt | grep 53001_1_18063_140
53001_1_18063_140       GATCAGAATGAATGATCCCATGAAACTACCGGACCAAGTTTGTTTTTGTTGTCCGGGATCCGAGATGTGTTTTCCT

$ cat doublecheck_collectContigs.unmatched1.txt | grep GATCAGAATGAATGATCCCATGAAACTACCGGACCAAGTTTGTTTTTGTTGTCCGGGATCCGAGATGTGTTTTCCT
53001_1_18062_140       GATCAGAATGAATGATCCCATGAAACTACCGGACCAAGTTTGTTTTTGTTGTCCGGGATCCGAGATGTGTTTTCCT

So everything matches except the SID is off by one.

In [20]: matches = [rec for rec in pysam.Samfile("/tier2/deweylab/nathanae/rnaseq-eval/data-sep7.single-0.001-sim/expression.rsem-expr-d/expression.sorted.bam", "rb") if rec.qname == "53001_1_18063_140"]

In [21]: for m in matches: print m
   ....:
   53001_1_18063_140       0       4293    3466    0       [(0, 76)]       -1      -1      76      GATCAGAATGAATGATCCCATGAAACTACCGGACCAAGTTTGTTTTTGTTGTCCGGGATCCGAGATGTGTTTTCCT  CCCCCCCDBCCB@############################################################### [('ZW', 0.0002902685373555869)]
   53001_1_18063_140       16      18061   423     3       [(0, 76)]       -1      -1      76      AGGAAAACACATCTCGGATCCCGGACAACAAAAACAAACTTGGTCCGGTAGTTTCATGGGATCATTCATTCTGATC  ###############################################################@BCCBDCCCCCCC [('ZW', 0.4998548626899719)]
   53001_1_18063_140       16      18062   423     3       [(0, 76)]       -1      -1      76      AGGAAAACACATCTCGGATCCCGGACAACAAAAACAAACTTGGTCCGGTAGTTTCATGGGATCATTCATTCTGATC  ###############################################################@BCCBDCCCCCCC [('ZW', 0.4998548626899719)]

In [28]: [m.rname+1 for m in matches]
Out[28]: [4294, 18062, 18063]

The SID associated with each alignment is the Out[28] above. So the problem is
that we are choosing the second alignment instead of the third one. Note that

In [29]: [m.mapq for m in matches]
Out[29]: [0L, 3L, 3L]

the quality of these two alignments is equal according to RSEM. So again it
seems like what we did was reasonable. 

[Sidenote: The isoform with SID 18062 is XR_105779.1 and the isofrom with SID
18063 is XR_106980.1. We check

$ dl-fasta-subset XR_105779.1 < ../mouse.rna.fa | dl-fasta-reverse-complement | dl-fasta-reformat -w 0
>XR_105779.1 PREDICTED: Mus musculus hypothetical LOC100503063 (LOC100503063), partial miscRNA.
GCAACCTAGGCTTGTGATCAGAATGAATGATCCCAAGAAACTACTTGACCAAGTTTGTTTTTGTTGTCCTGGATTCGAGATGTGTTTTCTTCCTTGCTCTAAGACTGTTGATGTATGAGTGTGAAGAGGCTACAGAAACAATGCCCAGATTTTCACTGTAACTTTCCCTTTGCCCACACTGTAGAACTTGAGATTGTTCACTGATAGTGCGTCTTTAGTAAGGATGTGTTAAAATATAGCAGTCTTTTTAAAAGATTATGCAGTTCTCTATTTATTGTGCTGTGCCTGGTCCTAAGTGCAGCCGGTTAAACAAGTTTCATATGTATTTTTCCAGTGTTAAATCTCATACCTATTCCCTTTGGAAACTTCCATCCTGAACAATGAATAGAAGAGGCTATATAAATTGCCTCCTTATCCTTAAGATTTCACTATCTTTATGTTAAGAGTAATGTATAATTATTAAAATCTATGAAAAATAAAAAGTGGATTTAAATTAAAAAAAAAAAAAAAAAAG

$ dl-fasta-subset XR_106980.1 < ../mouse.rna.fa | dl-fasta-reverse-complement | dl-fasta-reformat -w 0
>XR_106980.1 PREDICTED: Mus musculus hypothetical LOC100503063 (LOC100503063), partial miscRNA.
GCAACCTAGGCTTGTGATCAGAATGAATGATCCCAAGAAACTACTTGACCAAGTTTGTTTTTGTTGTCCTGGATTCGAGATGTGTTTTCTTCCTTGCTCTAAGACTGTTGATGTATGAGTGTGAAGAGGCTACAGAAACAATGCCCAGATTTTCACTGTAACTTTCCCTTTGCCCACACTGTAGAACTTGAGATTGTTCACTGATAGTGCGTCTTTAGTAAGGATGTGTTAAAATATAGCAGTCTTTTTAAAAGATTATGCAGTTCTCTATTTATTGTGCTGTGCCTGGTCCTAAGTGCAGCCGGTTAAACAAGTTTCATATGTATTTTTCCAGTGTTAAATCTCATACCTATTCCCTTTGGAAACTTCCATCCTGAACAATGAATAGAAGAGGCTATATAAATTGCCTCCTTATCCTTAAGATTTCACTATCTTTATGTTAAGAGTAATGTATAATTATTAAAATCTATGAAAAATAAAAAGTGGATTTAAATTAAAAAAAAAAAAAAAAAAG

Note that these two isoforms have identical sequences so it is quite reasonable
that we might choose one instead of the other.]


QUESTION 2. Why are there input reads whose sequences match an element of the
simulated set, but whose metadata don't match? I.e., why is num_unmatched1 not
0? Also, why does num_unmatched0 == num_unmatched1?

This is basically just becuase equality (and non-equality) of the metadata is
symmetric.


QUESTION 3. Why are there simulated reads whose sequences don't match any
element of the input set? I.e., why is num_not_found0 not 0?

This can happen if the read doesn't align to any isoform, or if all its
alignments have a low score according to RSEM. Indeed, looking at
doublecheck_collectContigs.not_found0.txt, which is a list of the simulated
reads whose sequences don't match any element of the input set, we see that
they are mostly noise reads or involve the poly(A) tail.

$ cat doublecheck_collectContigs.not_found0.txt | wc -l
23328
$ cat doublecheck_collectContigs.not_found0.txt | cut -f 1 | grep ^[0-9]*_0_0_0$ | wc -l
18864
$ cat doublecheck_collectContigs.not_found0.txt | grep -v ^[0-9]*_0_0_0$'\t' | grep $'\t'TTTTT | wc -l
1684
$ cat doublecheck_collectContigs.not_found0.txt | grep -v ^[0-9]*_0_0_0$'\t' | grep AAAAA$ | wc -l
498

So of the 23328 total not_found0 reads, 
- 18864/23328=0.81 of them are noise reads
- (1684+498)/23328=0.09 of them are not noise reads but probably involve the poly(A) tail
- 0.1 of them are other

This seems fairly reasonable, but can we explain the 0.1 of reads in the
"other" category? If we look for alignments involving reads in the "other"
category, we do find quite a few of them (sidenote: some reads in the "other"
category don't have any alignments at all; for example read 22348_1_21613_1821
is like this.):

$ for i in $(cat doublecheck_collectContigs.not_found0.txt | grep -v ^[0-9]*_0_0_0$'\t' | grep -v AAAAA$ | grep -v $'\t'TTTTT | cut -f 1); do samtools view expression.rsem-expr-d/expression.sorted.bam | grep ^$i$'\t'; done | wc -l
2575

However, almost all these alignments have MAPQ (rsem quality score) equal to 0:

$ for i in $(cat doublecheck_collectContigs.not_found0.txt | grep -v ^[0-9]*_0_0_0$'\t' | grep -v AAAAA$ | grep -v $'\t'TTTTT | cut -f 1); do samtools view -q 1 expression.rsem-expr-d/expression.sorted.bam | grep ^$i$'\t'; done | wc -l
1

The single matching alignment is:
0_0_23229_2613  0       NM_008624.3     2614   100      76M     *       0 0 GACGCTTTTGCTAGCTAGAGATCAAACTTCCTACTCACCAAACTGCCACCCGCCCCGATTTTTTTTTTTTTTTCTG CCBBCCCCCCBDB@@ACCCCCCDCCCCCCCDCCDCCBBCCCCC::??CCCCC8;:CCC@ACCCCCCCCCCCCCCCC ZW:f:1

Based on this alignment, the read 0_0_23229_2613 should have been included in
the input set. It was omitted because of a bug in my code: I started looking
for reads in some data structure with read indices starting at 1, but I should
have started at 0; the read above has index 0, so it was skipped. I have fixed
this bug.