Call the "simulated set" the output of rsem-simulate-reads, and call the "input set" the reconstructed input that I give to collectContigs. All paths are relative to /tier2/deweylab/nathanae/rnaseq-eval/data-sep7.single-0.001-sim. By running doublecheck_collectContigs_input.py we find that: num_not_found0 = 23328 num_not_found1 = 0 num_matches = 40130 num_unmatched0 = 5169 num_unmatched1 = 5169 Here: - num_not_found0 is the number of simulated reads whose sequence does not match any element of the input set. - num_not_found1 is the number of input reads whose sequence does not match any element of the simulated set. - num_matches is the number of reads with exactly the same sequence and metadata in the simulated set and the input set. - num_unmatched0 is the number of simulated reads whose sequence matches at least one element of the input set, but the metadata doesn't match. - num_unmatched1 is the number of input reads whose sequence matches at least one element of the simulated set, but the metadata doesn't match. QUESTION 1. Why are there simulated reads whose sequences match an element of the input set, but whose metadata don't match? I.e., why is num_unmatched0 not 0? We pick the following read from the simulated set whose sequence matched one in the input set, but whose metadata didn't match. $ cat doublecheck_collectContigs.unmatched0.txt | grep 49904_0_17296_141 49904_0_17296_141 CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG This matches the following read from the input set: $ cat doublecheck_collectContigs.unmatched1.txt | grep CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG 49904_0_29255_187 CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG Let's look for the simulated read's id in the BAM file, since this is where the reconstructed metadata comes from: $ samtools view expression.rsem-expr-d/expression.sorted.bam | grep 49904_0_17296_141 49904_0_17296_141 0 XM_001475749.2 142 1 76M * 0 0 CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG ?########################################################################### ZW:f:0.15749 49904_0_17296_141 0 XM_003086467.1 143 0 76M * 0 0 CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG ?########################################################################### ZW:f:1.68051e-24 49904_0_17296_141 0 NM_009091.2 188 8 76M * 0 0 CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG ?########################################################################### ZW:f:0.84251 Hypothesis: we are for some reason wrongly choosing the NM_009091.2 match instead of the XM_001475749.2 match, and this is what causes the incorrect metadata. I don't remember what all these fields mean, so let's look at the bam file in python. In [6]: matches = [rec for rec in pysam.Samfile("/tier2/deweylab/nathanae/rnaseq-eval/data-sep7.single-0.001-sim/expression.rsem-expr-d/expression.sorted.bam", "rb") if rec.qname == "49904_0_17296_141"] In [12]: for m in matches: print m 49904_0_17296_141 0 17295 141 1 [(0, 76)] -1 -1 76 CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG ?########################################################################### [('ZW', 0.15749017894268036)] 49904_0_17296_141 0 17296 142 0 [(0, 76)] -1 -1 76 CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG ?########################################################################### [('ZW', 1.6805095975314162e-24)] 49904_0_17296_141 0 29254 187 8 [(0, 76)] -1 -1 76 CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG ?########################################################################### [('ZW', 0.8425098061561584)] In [14]: [m.mapq for m in matches] Out[14]: [1L, 0L, 8L] So the reason that we chose the NM_009091.2 match is that its alignment to the simulated read has a higher score according to RSEM than the "true" match does. But the metadata would have been correct if we had chosen the first one (note that we set sid=rec.rname+1, and 17296==17295+1). [Sidenote: Here are the two isoforms: $ dl-fasta-subset XM_001475749.2 < ../mouse.rna.fa | dl-fasta-reformat -w 0 >XM_001475749.2 PREDICTED: Mus musculus 40S ribosomal protein S15-like, transcript variant 1 (LOC100046223), mRNA. CAAGATGGCCGAAGTGGAGCAGAAGAAGAAGCGCACCTTCCGCAAGTTCACCTACCGTGGCGTAGACCTCGACCAACTGCTCGACATGTCCTATGAGCAACTGATGCAGCTGTACAGCGCCCGCCAGAGGAGGCGCCTGAACCGTGGTCTTCGGCGGAAGCAACACTCACTGCTCAAGCGCTTGAGAAAGGCCAAAAAGGAGGCACCACCCATGGAGAAGCCTGAGGTGGTGAAGACGCACCTGAGGGACATGATCATCCTGCCGGAGATGGTGGGTAGCATGGTGGGCATGTACAACGGCAAGACCTTCAACCAGGTGGAGATCAAACCAGAGATGATCGGCCACTACCTGGGCGAGTTCTCCATCACCTACAAACCCGTGAAGCACGGCCGGCCCGGGATCGGTGCCACCCACTCCTCCCGATTCATCCCCCTCAAGTAGCCGAGGCCAATAAAGACTGGTTTTGGTCCCTGGAAAAAAAAAAAAAGAA $ dl-fasta-subset NM_009091.2 < ../mouse.rna.fa | dl-fasta-reformat -w 0 >NM_009091.2 Mus musculus ribosomal protein S15 (Rps15), mRNA. GATAACTGCGCAAGCGCAGCCCGTACTTCCTTTTCCGAGTAACCGCCAAGATGGCCGAAGTGGAGCAGAAGAAGAAGCGCACCTTCCGCAAGTTCACCTACCGTGGCGTAGACCTCGACCAACTGCTCGACATGTCCTATGAGCAACTGATGCAGCTGTACAGCGCCCGCCAGAGGAGGCGCCTGAACCGTGGTCTTCGGCGGAAGCAACACTCACTGCTCAAGCGCTTGAGAAAGGCCAAAAAGGAGGCACCACCCATGGAGAAGCCTGAGGTGGTGAAGACGCACCTGAGGGACATGATCATCCTGCCGGAGATGGTGGGTAGCATGGTGGGCGTGTACAACGGCAAGACCTTCAACCAGGTGGAGATCAAACCAGAGATGATCGGCCACTACCTGGGCGAGTTCTCCATCACCTACAAACCCGTGAAGCACGGCCGGCCCGGGATCGGTGCCACCCACTCCTCCCGATTCATCCCCCTCAAGTAGCCGAGGCCAATAAAGACTGGTTTTGGTCCCTGGA Below are listed: - the substring of XM_001475749.2 from 141 to 141+76 - the substring of NM_009091.2 from 187 to 187+76 - the simulated read CCGTGGTCTTCGGCGGAAGCAACACTCACTGCTCAAGCGCTTGAGAAAGGCCAAAAAGGAGGCACCACCCATGGAG CCGTGGTCTTCGGCGGAAGCAACACTCACTGCTCAAGCGCTTGAGAAAGGCCAAAAAGGAGGCACCACCCATGGAG CCGTGGTCTTCGGCGGAAGCAACCCTCACCGATCAAGCGCTTGAGAAGGGCCACAAAGGAGGCCCAACGCATGGAG Note that the first two, i.e. the two isoform subsequences, are exactly the same. They differ at 8 bases from the read. So although I don't know why RSEM gave the alignment to NM_009091.2 a higher score than that to XM_001475749.2, it is at least not completely surprising that RSEM did not prefer the "true" alignment.] ---- Let's doublecheck the above with another random read from the simulated set whose sequence matched an element of the input set, but whose metadata didn't match, say 53001_1_18063_140. $ cat doublecheck_collectContigs.unmatched0.txt | grep 53001_1_18063_140 53001_1_18063_140 GATCAGAATGAATGATCCCATGAAACTACCGGACCAAGTTTGTTTTTGTTGTCCGGGATCCGAGATGTGTTTTCCT $ cat doublecheck_collectContigs.unmatched1.txt | grep GATCAGAATGAATGATCCCATGAAACTACCGGACCAAGTTTGTTTTTGTTGTCCGGGATCCGAGATGTGTTTTCCT 53001_1_18062_140 GATCAGAATGAATGATCCCATGAAACTACCGGACCAAGTTTGTTTTTGTTGTCCGGGATCCGAGATGTGTTTTCCT So everything matches except the SID is off by one. In [20]: matches = [rec for rec in pysam.Samfile("/tier2/deweylab/nathanae/rnaseq-eval/data-sep7.single-0.001-sim/expression.rsem-expr-d/expression.sorted.bam", "rb") if rec.qname == "53001_1_18063_140"] In [21]: for m in matches: print m ....: 53001_1_18063_140 0 4293 3466 0 [(0, 76)] -1 -1 76 GATCAGAATGAATGATCCCATGAAACTACCGGACCAAGTTTGTTTTTGTTGTCCGGGATCCGAGATGTGTTTTCCT CCCCCCCDBCCB@############################################################### [('ZW', 0.0002902685373555869)] 53001_1_18063_140 16 18061 423 3 [(0, 76)] -1 -1 76 AGGAAAACACATCTCGGATCCCGGACAACAAAAACAAACTTGGTCCGGTAGTTTCATGGGATCATTCATTCTGATC ###############################################################@BCCBDCCCCCCC [('ZW', 0.4998548626899719)] 53001_1_18063_140 16 18062 423 3 [(0, 76)] -1 -1 76 AGGAAAACACATCTCGGATCCCGGACAACAAAAACAAACTTGGTCCGGTAGTTTCATGGGATCATTCATTCTGATC ###############################################################@BCCBDCCCCCCC [('ZW', 0.4998548626899719)] In [28]: [m.rname+1 for m in matches] Out[28]: [4294, 18062, 18063] The SID associated with each alignment is the Out[28] above. So the problem is that we are choosing the second alignment instead of the third one. Note that In [29]: [m.mapq for m in matches] Out[29]: [0L, 3L, 3L] the quality of these two alignments is equal according to RSEM. So again it seems like what we did was reasonable. [Sidenote: The isoform with SID 18062 is XR_105779.1 and the isofrom with SID 18063 is XR_106980.1. We check $ dl-fasta-subset XR_105779.1 < ../mouse.rna.fa | dl-fasta-reverse-complement | dl-fasta-reformat -w 0 >XR_105779.1 PREDICTED: Mus musculus hypothetical LOC100503063 (LOC100503063), partial miscRNA. GCAACCTAGGCTTGTGATCAGAATGAATGATCCCAAGAAACTACTTGACCAAGTTTGTTTTTGTTGTCCTGGATTCGAGATGTGTTTTCTTCCTTGCTCTAAGACTGTTGATGTATGAGTGTGAAGAGGCTACAGAAACAATGCCCAGATTTTCACTGTAACTTTCCCTTTGCCCACACTGTAGAACTTGAGATTGTTCACTGATAGTGCGTCTTTAGTAAGGATGTGTTAAAATATAGCAGTCTTTTTAAAAGATTATGCAGTTCTCTATTTATTGTGCTGTGCCTGGTCCTAAGTGCAGCCGGTTAAACAAGTTTCATATGTATTTTTCCAGTGTTAAATCTCATACCTATTCCCTTTGGAAACTTCCATCCTGAACAATGAATAGAAGAGGCTATATAAATTGCCTCCTTATCCTTAAGATTTCACTATCTTTATGTTAAGAGTAATGTATAATTATTAAAATCTATGAAAAATAAAAAGTGGATTTAAATTAAAAAAAAAAAAAAAAAAG $ dl-fasta-subset XR_106980.1 < ../mouse.rna.fa | dl-fasta-reverse-complement | dl-fasta-reformat -w 0 >XR_106980.1 PREDICTED: Mus musculus hypothetical LOC100503063 (LOC100503063), partial miscRNA. GCAACCTAGGCTTGTGATCAGAATGAATGATCCCAAGAAACTACTTGACCAAGTTTGTTTTTGTTGTCCTGGATTCGAGATGTGTTTTCTTCCTTGCTCTAAGACTGTTGATGTATGAGTGTGAAGAGGCTACAGAAACAATGCCCAGATTTTCACTGTAACTTTCCCTTTGCCCACACTGTAGAACTTGAGATTGTTCACTGATAGTGCGTCTTTAGTAAGGATGTGTTAAAATATAGCAGTCTTTTTAAAAGATTATGCAGTTCTCTATTTATTGTGCTGTGCCTGGTCCTAAGTGCAGCCGGTTAAACAAGTTTCATATGTATTTTTCCAGTGTTAAATCTCATACCTATTCCCTTTGGAAACTTCCATCCTGAACAATGAATAGAAGAGGCTATATAAATTGCCTCCTTATCCTTAAGATTTCACTATCTTTATGTTAAGAGTAATGTATAATTATTAAAATCTATGAAAAATAAAAAGTGGATTTAAATTAAAAAAAAAAAAAAAAAAG Note that these two isoforms have identical sequences so it is quite reasonable that we might choose one instead of the other.] QUESTION 2. Why are there input reads whose sequences match an element of the simulated set, but whose metadata don't match? I.e., why is num_unmatched1 not 0? Also, why does num_unmatched0 == num_unmatched1? This is basically just becuase equality (and non-equality) of the metadata is symmetric. QUESTION 3. Why are there simulated reads whose sequences don't match any element of the input set? I.e., why is num_not_found0 not 0? This can happen if the read doesn't align to any isoform, or if all its alignments have a low score according to RSEM. Indeed, looking at doublecheck_collectContigs.not_found0.txt, which is a list of the simulated reads whose sequences don't match any element of the input set, we see that they are mostly noise reads or involve the poly(A) tail. $ cat doublecheck_collectContigs.not_found0.txt | wc -l 23328 $ cat doublecheck_collectContigs.not_found0.txt | cut -f 1 | grep ^[0-9]*_0_0_0$ | wc -l 18864 $ cat doublecheck_collectContigs.not_found0.txt | grep -v ^[0-9]*_0_0_0$'\t' | grep $'\t'TTTTT | wc -l 1684 $ cat doublecheck_collectContigs.not_found0.txt | grep -v ^[0-9]*_0_0_0$'\t' | grep AAAAA$ | wc -l 498 So of the 23328 total not_found0 reads, - 18864/23328=0.81 of them are noise reads - (1684+498)/23328=0.09 of them are not noise reads but probably involve the poly(A) tail - 0.1 of them are other This seems fairly reasonable, but can we explain the 0.1 of reads in the "other" category? If we look for alignments involving reads in the "other" category, we do find quite a few of them (sidenote: some reads in the "other" category don't have any alignments at all; for example read 22348_1_21613_1821 is like this.): $ for i in $(cat doublecheck_collectContigs.not_found0.txt | grep -v ^[0-9]*_0_0_0$'\t' | grep -v AAAAA$ | grep -v $'\t'TTTTT | cut -f 1); do samtools view expression.rsem-expr-d/expression.sorted.bam | grep ^$i$'\t'; done | wc -l 2575 However, almost all these alignments have MAPQ (rsem quality score) equal to 0: $ for i in $(cat doublecheck_collectContigs.not_found0.txt | grep -v ^[0-9]*_0_0_0$'\t' | grep -v AAAAA$ | grep -v $'\t'TTTTT | cut -f 1); do samtools view -q 1 expression.rsem-expr-d/expression.sorted.bam | grep ^$i$'\t'; done | wc -l 1 The single matching alignment is: 0_0_23229_2613 0 NM_008624.3 2614 100 76M * 0 0 GACGCTTTTGCTAGCTAGAGATCAAACTTCCTACTCACCAAACTGCCACCCGCCCCGATTTTTTTTTTTTTTTCTG CCBBCCCCCCBDB@@ACCCCCCDCCCCCCCDCCDCCBBCCCCC::??CCCCC8;:CCC@ACCCCCCCCCCCCCCCC ZW:f:1 Based on this alignment, the read 0_0_23229_2613 should have been included in the input set. It was omitted because of a bug in my code: I started looking for reads in some data structure with read indices starting at 1, but I should have started at 0; the read above has index 0, so it was skipped. I have fixed this bug.