+---------------------------------+
| RSEM collectContigs: ws3 vs ws4 |
+---------------------------------+

[Note: Filenames in this section are relative to
/u/nathanae/summer11/rnaseq-eval/git/data-jul26/ws3ws4analysis.]

----

The assemblies ws3.fa and ws4.fa (produced by collectContigs with windowsize
set to 3 and 4) differ only in that ws4.fa splits up one contig from ws3.fa.

These contigs are as follows. In ws3.fa: 
  >scaffold_25_contig_4 scaffold_25 tid=9388 pos=672 length=2123 number_of_contigs=12
  CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATT
  ATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCACAAGACGG
  CGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCC
  CTGAAG
In ws4.fa:
  >scaffold_25_contig_4 scaffold_25 tid=9388 pos=672 length=879 number_of_contigs=6
  CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATT
  ATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCAC
  >scaffold_26_contig_0 scaffold_26 tid=9388 pos=1548 length=1247 number_of_contigs=7
  CACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTG
  GGATACACCCCTGAAG

If we put each contig's sequence all on one line, we see:
  ws3: >scaffold_25_contig_4 CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG
  ws4: >scaffold_25_contig_4 CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCAC
  ws4: >scaffold_26_contig_0 ..............................................................................................................CACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG

So it is reasonable to conjecture that there are two reads, one of which ends
with ...TATCATCCAC and one of which starts with CACAAGACGG..., so that the
overlap is exactly three bases - so collectContigs joins the reads into one
contig if windowSize is 3, but does not do so if windowSize is 4.

----

Indeed, by running:
- cat sample_1.fq |grep -B 10 ATCATCCAC$
- cat sample_2.fq |grep -B 10 ATCATCCAC$
- cat sample_1.fq |dl-fastq-to-fasta |dl-fasta-reverse-complement |grep -B 10 ATCATCCAC$
- cat sample_2.fq |dl-fastq-to-fasta |dl-fasta-reverse-complement |grep -B 10 ATCATCCAC$
we see that exactly one read matches the end of ws4's scaffold_25_contig_4,
namely the reverse complemented
  >SRR203276.44266061 length=76
  GAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGA
  CAAAAATATCATCCAC
and by running similar commands we see that exactly one read matches the
beginning of ws4's scaffold_26_contig_0, namely
  @SRR203276.5739758 length=76
  CACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG

----

Next, we look at expression levels of the contigs mentioned above in ws3.fa
and ws4.fa, as estimated by RSEM. For ws3 (ws3.expression.isoforms.results),
by looking at

scaffold_25_contig_0   0.00      0                     0.00      0
scaffold_25_contig_1   0.00      0                     0.00      0
scaffold_25_contig_10  0.00      0                     0.00      0
scaffold_25_contig_2   0.00      0                     0.00      0
scaffold_25_contig_3   0.00      0                     0.00      0
scaffold_25_contig_4   0.00      0                     0.00      0
scaffold_25_contig_5   0.00      0                     0.00      0
scaffold_25_contig_6   2.00      0.00533127944549518   2.00      0.00778555709739731
scaffold_25_contig_7   0.00      0                     0.00      0
scaffold_25_contig_8   1.00      0.000187000963952137  1.00      0.000364116893390841
scaffold_25_contig_9   0.00      0                     0.00      0

we see that scaffold_25_contig_4 is considered unexpressed. For ws4
(ws4.expression.isoforms.results), by looking at

scaffold_25_contig_0  0.00      0                     0.00      0
scaffold_25_contig_1  0.00      0                     0.00      0
scaffold_25_contig_2  0.00      0                     0.00      0
scaffold_25_contig_3  0.00      0                     0.00      0
scaffold_25_contig_4  0.00      0                     0.00      0
scaffold_26_contig_0  0.00      0                     0.00      0
scaffold_26_contig_1  0.00      0                     0.00      0
scaffold_26_contig_2  2.00      0.00533099260219044   2.00      0.00778414123753352
scaffold_26_contig_3  0.00      0                     0.00      0
scaffold_26_contig_4  1.00      0.00018700369857571   1.00      0.000364075588489139
scaffold_26_contig_5  0.00      0                     0.00      0
scaffold_26_contig_6  0.00      0                     0.00      0

we see that again both scaffold_26_contig_4 and scaffold_26_contig_0 are
considered unexpressed.

There are minor differences of expression of other contigs in ws3 versus ws4,
but no substantial ones as far as I can see.

----

So the situation is that 
- ws3 and ws4 differ by just one contig being broken up in ws4; this appears
  correct based on the window size and the fact that just one 
- that contig (and the broken-up pieces) have no expression; this appears
  correct based on the fact that 
- ws3 has higher loglikelihood than ws4


+----------------------------------------+
|  Trinity: pfl90 vs pfl100 (with mcl20) |
+----------------------------------------+

[Note: Filenames in this section are relative to
/u/nathanae/summer11/rnaseq-eval/git/data-jul26/pfl90pfl100analysis.]

The pfl90 assembly has 211 contigs while the pfl100 assembly has 210 contigs.
By sorting then diffing, we see that the sequence differences occur on the
following lines:

$ diff pfl90.fa.sorted pfl100.fa.sorted |head -n 20
4d3
< AA
66,67c65
< AACTCTTCTTGAATTCTACAGCAGGAGCAAGCGGTGGTGCTTACGAACACAGATATGTAG
< AACTCTTCTTGAATTCTACAGCAGGAGCAAGCGGTGGTGCTTACGAACACAGATATGTAG
---
> AACTCTTCTTGAATTCTACAGCAGGAGCAAGCGGTGGTGCTTAC
249d246
< AGCAAATATGCAACACAGATATGTAGAGCTCTTCTTGAATTCTACAGCAGGAGCAAGCGG
1410,1411c1407
< TGGTGCTTACGAACACAGATATGTAGAA
< TGGTGCTTACGAACACAGATATGTAGAA
---
> TGGTGCTTAC

By looking in pfl90.fa and pfl100.fa, we see that the above differences
correspond to the following contigs in pfl90.fa:

  >trinity_203 comp12_c0_seq1_FPKM_all:2837.677_FPKM_rel:1974.036_len:148_path:[0,28,156,158,157]
  AGCAAATATGCAACACAGATATGTAGAGCTCTTCTTGAATTCTACAGCAGGAGCAAGCGG
  TGGTGCTTACGAACACAGATATGTAGAACTCTTCTTGAATTCTACAGCAGGAGCAAGCGG
  TGGTGCTTACGAACACAGATATGTAGAA
  >trinity_204 comp12_c0_seq2_FPKM_all:829.993_FPKM_rel:414.996_len:88_path:[0,158,157]
  AGCAAATATGCAACACAGATATGTAGAGCTCTTCTTGAATTCTACAGCAGGAGCAAGCGG
  TGGTGCTTACGAACACAGATATGTAGAA
  >trinity_205 comp12_c0_seq3_FPKM_all:1304.274_FPKM_rel:601.973_len:182_path:[112,70,28,156,158,157]
  TGTCTTTTTTTAAGCTAACCTTGTATGCCTTTTCTCTCATTTCAGAACACAGATATGTAG
  AACTCTTCTTGAATTCTACAGCAGGAGCAAGCGGTGGTGCTTACGAACACAGATATGTAG
  AACTCTTCTTGAATTCTACAGCAGGAGCAAGCGGTGGTGCTTACGAACACAGATATGTAG
  AA

being replaced by the following contigs in pfl100.fa:

  >trinity_178 comp12_c0_seq1_FPKM_all:3230.587_FPKM_rel:3090.126_len:130_path:[0,28,70,156]
  AGCAAATATGCAACACAGATATGTAGAGCTCTTCTTGAATTCTACAGCAGGAGCAAGCGG
  TGGTGCTTACGAACACAGATATGTAGAACTCTTCTTGAATTCTACAGCAGGAGCAAGCGG
  TGGTGCTTAC
  >trinity_179 comp12_c0_seq2_FPKM_all:1053.452_FPKM_rel:877.877_len:104_path:[112,70,156]
  TGTCTTTTTTTAAGCTAACCTTGTATGCCTTTTCTCTCATTTCAGAACACAGATATGTAG
  AACTCTTCTTGAATTCTACAGCAGGAGCAAGCGGTGGTGCTTAC

Notes:
- pfl100's trinity_178 is a prefix of pfl90's trinity_203
- pfl90's trinity_204 is like trinity_203, but missing a repeated section
- pfl100's trinity_179 is a prefix of pfl90's  trinity_205

Note that the paired fragment length is the "maximum length expected between
fragment pairs". Right now it is not clear to me why changing this parameter
leads to the above results.

----

By looking at pfl90.expression.isoforms.results we see that
  trinity_205  0.00      0                     0.00      0
  trinity_204  0.00      0                     0.00      0
  trinity_203  0.00      0                     0.00      0
i.e., trinity_203, trinity_204, and trinity_205 all have zero expression.
Similarly, by looking at pfl100.expression.isoforms.results, we see that
the relevant transcripts have zero expression:
  trinity_179  0.00      0                     0.00      0
  trinity_178  0.00      0                     0.00      0