+---------------------------------+ | RSEM collectContigs: ws3 vs ws4 | +---------------------------------+ [Note: Filenames in this section are relative to /u/nathanae/summer11/rnaseq-eval/git/data-jul26/ws3ws4analysis.] ---- The assemblies ws3.fa and ws4.fa (produced by collectContigs with windowsize set to 3 and 4) differ only in that ws4.fa splits up one contig from ws3.fa. These contigs are as follows. In ws3.fa: >scaffold_25_contig_4 scaffold_25 tid=9388 pos=672 length=2123 number_of_contigs=12 CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATT ATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCACAAGACGG CGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCC CTGAAG In ws4.fa: >scaffold_25_contig_4 scaffold_25 tid=9388 pos=672 length=879 number_of_contigs=6 CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATT ATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCAC >scaffold_26_contig_0 scaffold_26 tid=9388 pos=1548 length=1247 number_of_contigs=7 CACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTG GGATACACCCCTGAAG If we put each contig's sequence all on one line, we see: ws3: >scaffold_25_contig_4 CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG ws4: >scaffold_25_contig_4 CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCAC ws4: >scaffold_26_contig_0 ..............................................................................................................CACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG So it is reasonable to conjecture that there are two reads, one of which ends with ...TATCATCCAC and one of which starts with CACAAGACGG..., so that the overlap is exactly three bases - so collectContigs joins the reads into one contig if windowSize is 3, but does not do so if windowSize is 4. ---- Indeed, by running: - cat sample_1.fq |grep -B 10 ATCATCCAC$ - cat sample_2.fq |grep -B 10 ATCATCCAC$ - cat sample_1.fq |dl-fastq-to-fasta |dl-fasta-reverse-complement |grep -B 10 ATCATCCAC$ - cat sample_2.fq |dl-fastq-to-fasta |dl-fasta-reverse-complement |grep -B 10 ATCATCCAC$ we see that exactly one read matches the end of ws4's scaffold_25_contig_4, namely the reverse complemented >SRR203276.44266061 length=76 GAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGA CAAAAATATCATCCAC and by running similar commands we see that exactly one read matches the beginning of ws4's scaffold_26_contig_0, namely @SRR203276.5739758 length=76 CACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG ---- Next, we look at expression levels of the contigs mentioned above in ws3.fa and ws4.fa, as estimated by RSEM. For ws3 (ws3.expression.isoforms.results), by looking at scaffold_25_contig_0 0.00 0 0.00 0 scaffold_25_contig_1 0.00 0 0.00 0 scaffold_25_contig_10 0.00 0 0.00 0 scaffold_25_contig_2 0.00 0 0.00 0 scaffold_25_contig_3 0.00 0 0.00 0 scaffold_25_contig_4 0.00 0 0.00 0 scaffold_25_contig_5 0.00 0 0.00 0 scaffold_25_contig_6 2.00 0.00533127944549518 2.00 0.00778555709739731 scaffold_25_contig_7 0.00 0 0.00 0 scaffold_25_contig_8 1.00 0.000187000963952137 1.00 0.000364116893390841 scaffold_25_contig_9 0.00 0 0.00 0 we see that scaffold_25_contig_4 is considered unexpressed. For ws4 (ws4.expression.isoforms.results), by looking at scaffold_25_contig_0 0.00 0 0.00 0 scaffold_25_contig_1 0.00 0 0.00 0 scaffold_25_contig_2 0.00 0 0.00 0 scaffold_25_contig_3 0.00 0 0.00 0 scaffold_25_contig_4 0.00 0 0.00 0 scaffold_26_contig_0 0.00 0 0.00 0 scaffold_26_contig_1 0.00 0 0.00 0 scaffold_26_contig_2 2.00 0.00533099260219044 2.00 0.00778414123753352 scaffold_26_contig_3 0.00 0 0.00 0 scaffold_26_contig_4 1.00 0.00018700369857571 1.00 0.000364075588489139 scaffold_26_contig_5 0.00 0 0.00 0 scaffold_26_contig_6 0.00 0 0.00 0 we see that again both scaffold_26_contig_4 and scaffold_26_contig_0 are considered unexpressed. There are minor differences of expression of other contigs in ws3 versus ws4, but no substantial ones as far as I can see. ---- So the situation is that - ws3 and ws4 differ by just one contig being broken up in ws4; this appears correct based on the window size and the fact that just one - that contig (and the broken-up pieces) have no expression; this appears correct based on the fact that - ws3 has higher loglikelihood than ws4 +----------------------------------------+ | Trinity: pfl90 vs pfl100 (with mcl20) | +----------------------------------------+ [Note: Filenames in this section are relative to /u/nathanae/summer11/rnaseq-eval/git/data-jul26/pfl90pfl100analysis.] The pfl90 assembly has 211 contigs while the pfl100 assembly has 210 contigs. By sorting then diffing, we see that the sequence differences occur on the following lines: $ diff pfl90.fa.sorted pfl100.fa.sorted |head -n 20 4d3 < AA 66,67c65 < AACTCTTCTTGAATTCTACAGCAGGAGCAAGCGGTGGTGCTTACGAACACAGATATGTAG < AACTCTTCTTGAATTCTACAGCAGGAGCAAGCGGTGGTGCTTACGAACACAGATATGTAG --- > AACTCTTCTTGAATTCTACAGCAGGAGCAAGCGGTGGTGCTTAC 249d246 < AGCAAATATGCAACACAGATATGTAGAGCTCTTCTTGAATTCTACAGCAGGAGCAAGCGG 1410,1411c1407 < TGGTGCTTACGAACACAGATATGTAGAA < TGGTGCTTACGAACACAGATATGTAGAA --- > TGGTGCTTAC By looking in pfl90.fa and pfl100.fa, we see that the above differences correspond to the following contigs in pfl90.fa: >trinity_203 comp12_c0_seq1_FPKM_all:2837.677_FPKM_rel:1974.036_len:148_path:[0,28,156,158,157] AGCAAATATGCAACACAGATATGTAGAGCTCTTCTTGAATTCTACAGCAGGAGCAAGCGG TGGTGCTTACGAACACAGATATGTAGAACTCTTCTTGAATTCTACAGCAGGAGCAAGCGG TGGTGCTTACGAACACAGATATGTAGAA >trinity_204 comp12_c0_seq2_FPKM_all:829.993_FPKM_rel:414.996_len:88_path:[0,158,157] AGCAAATATGCAACACAGATATGTAGAGCTCTTCTTGAATTCTACAGCAGGAGCAAGCGG TGGTGCTTACGAACACAGATATGTAGAA >trinity_205 comp12_c0_seq3_FPKM_all:1304.274_FPKM_rel:601.973_len:182_path:[112,70,28,156,158,157] TGTCTTTTTTTAAGCTAACCTTGTATGCCTTTTCTCTCATTTCAGAACACAGATATGTAG AACTCTTCTTGAATTCTACAGCAGGAGCAAGCGGTGGTGCTTACGAACACAGATATGTAG AACTCTTCTTGAATTCTACAGCAGGAGCAAGCGGTGGTGCTTACGAACACAGATATGTAG AA being replaced by the following contigs in pfl100.fa: >trinity_178 comp12_c0_seq1_FPKM_all:3230.587_FPKM_rel:3090.126_len:130_path:[0,28,70,156] AGCAAATATGCAACACAGATATGTAGAGCTCTTCTTGAATTCTACAGCAGGAGCAAGCGG TGGTGCTTACGAACACAGATATGTAGAACTCTTCTTGAATTCTACAGCAGGAGCAAGCGG TGGTGCTTAC >trinity_179 comp12_c0_seq2_FPKM_all:1053.452_FPKM_rel:877.877_len:104_path:[112,70,156] TGTCTTTTTTTAAGCTAACCTTGTATGCCTTTTCTCTCATTTCAGAACACAGATATGTAG AACTCTTCTTGAATTCTACAGCAGGAGCAAGCGGTGGTGCTTAC Notes: - pfl100's trinity_178 is a prefix of pfl90's trinity_203 - pfl90's trinity_204 is like trinity_203, but missing a repeated section - pfl100's trinity_179 is a prefix of pfl90's trinity_205 Note that the paired fragment length is the "maximum length expected between fragment pairs". Right now it is not clear to me why changing this parameter leads to the above results. ---- By looking at pfl90.expression.isoforms.results we see that trinity_205 0.00 0 0.00 0 trinity_204 0.00 0 0.00 0 trinity_203 0.00 0 0.00 0 i.e., trinity_203, trinity_204, and trinity_205 all have zero expression. Similarly, by looking at pfl100.expression.isoforms.results, we see that the relevant transcripts have zero expression: trinity_179 0.00 0 0.00 0 trinity_178 0.00 0 0.00 0