[All paths are relative to
/u/nathanae/summer11/rnaseq-eval/git/data-jul26/ws3ws4analysis.]

Here we are trying to figure out why there are zero-expression contigs in the
output of collectContigs as run by make_rsem_oracleset.py. We continue our
investigation of ws3 and ws4; here, ws3's scaffold_25_contig_4 has zero
expression and is split up, in ws4, into scaffold_25_contig_4 and
scaffold_26_contig_0, both of which have zero expression.

Based on the analysis in difference.txt, it looks like SRR203276.44266061 and
SRR203276.5739758 are the two reads from sample_(1|2).fq that the above
contigs are made from.

When we ran make_rsem_oracleset.py (which runs collectContigs) we gave
--rsem-expression=/tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression
as a parameter. The bam file in this directory was used by
make_rsem_oracleset.py to determine which part of the genome each read best
aligns to, and then that info was given to collectContigs.

If we run
  samtools view /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression.sorted.bam|grep SRR203276.44266061
we see
  SRR203276.44266061      163     NM_011808.2     1217    100     76M     =       1476    335     CCGCTGCCCTGCCCAACCACAAGCCCAAGGGCACCTTCAAGGACTATGTGCGTGACCGTGCTGACCTCAACAAGGA CCCCCCCCCCCCCCCCCCCCCCCCCCCCCDCCCCCCCCCCCDBCCCCCBCCCDCCCBCBCDDDDABDBDBDBBDBB    ZW:f:1
  SRR203276.44266061      83      NM_011808.2     1476    100     76M     =       1217    -335    GAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCAC B@@@CBCCDBBCACCDCBDCCBBCACCCCCDCDCCDCCCCCCCCCCCCCCCCCCCCCCB@BBCCCCCCCCCCCCCC    ZW:f:1
i.e., this read (both pairs) aligns to isoform NM_011808.2. Let's look for
this isoform in other assemblies. It does not appear in ../oracleset.fa. That
is not surprising, because:
- If we look at
    dl-fasta-subset NM_011808.2 < /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/sample.rsem-workingdir/gbff.fa
  and we pick the following 25-mer from the output
    ATGACAAAAATATCATCCACAAGAC
  and we check for it or its reverse complement in sample_1.fq and
  sample_2.fq:
    ~/summer11/rnaseq-eval/git/fastq_grep.sh ATGACAAAAATATCATCCACAAGAC ../sample_1.fq ../sample_2.fq
  we find that there are no matches.
Furthermore, NM_011808.2 does not occur (as the best match) in any of the
assemblies produced by either collectContigs or trinity. 

So what exactly is happening right now again?
- (WH1) We align the reads to the refseq gbff isoforms. We find that read
  SRR203276.44266061 aligns to isoform NM_011808.2 (both elements of the pair
  align). This is found using rsem's bowtie defaults. [In fact, both elements
  of the pair are exact substrings of the isoform or its rc - see below.]
- (WH2) We run collectContigs with window size 3. We find a contig, namely ws3's
  scaffold_25_contig_4. 
- (WH3) We align the reads to the contig of ws3's assembly. We find that no
  reads align to contig scaffold_25_contig_4, according to bowtie run again
  with rsem's bowtie defaults.
[The alignment for (WH1) is in
/tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression.sorted.bam,
and the alignment for (WH3) is in ./ws3.expression.sorted.bam (and
./ws3.expression.sam.gz).]

The sequences of all these reads/isoforms/contigs are as follows:
- SRR203276.44266061:
  $ cat ../sample_1.fq |dl-fastq-to-fasta|dl-fasta-subset SRR203276.44266061
  >SRR203276.44266061 length=76
  GTGGATGATATTTTTGTCATAATAGTAGCGAAGGCCACGGCTCAGTTTCTCATAATTCATCTTAGGTTTGTTTTTC
  $ cat ../sample_1.fq |dl-fastq-to-fasta|dl-fasta-subset SRR203276.44266061|dl-fasta-reverse-complement
  >SRR203276.44266061 length=76
  GAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCAC
  $ cat ../sample_2.fq |dl-fastq-to-fasta|dl-fasta-subset SRR203276.44266061
  >SRR203276.44266061 length=76
  CCGCTGCCCTGCCCAACCACAAGCCCAAGGGCACCTTCAAGGACTATGTGCGTGACCGTGCTGACCTCAACAAGGA
  $ cat ../sample_2.fq |dl-fastq-to-fasta|dl-fasta-subset SRR203276.44266061|dl-fasta-reverse-complement
  >SRR203276.44266061 length=76
  TCCTTGTTGAGGTCAGCACGGTCACGCACATAGTCCTTGAAGGTGCCCTTGGGCTTGTGGTTGGGCAGGGCAGCGG
- NM_011808.2 is very long. It begins:
  >NM_011808.2
  CGGAGGCGAGCTGCACTGCGCCAGCGCCGGCCAGCCGTGCGAGCGAGCAAGGGAGCGAGCGCCCCGGACGGAGGAGGGAGCGAGCGCCCGGGACAGAGGAGCGAGCGGGCGGGCGAGGCG
  and ends:
  ATTTTTTATATACTGTAGGCTGATTTCATATTGTATTTTAAACTGTGTGAAATTAAAGAAACAAAGAAATTCATTCATAA
  In the middle somewhere occurs the fragment:
  TATGTGCGTGACCGTGCTGACCTCAACAAGGACAAGCCTGTCATTCCTGCTGCTGCCCTGGCTGGCTACACAGGAAGTGGGCCGATCCAGCTGTGGCAGTTTCTTCTGGAATTACTCACTGATAAGTCTTGTCAGTCCTTTATCAGCTGGACAGGAGATGGCTGGGAATTCAAGCTTTCTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAGAGCTGCACGCCATGCTGGATGTAAAGCCGGATGCTGACTAGTCATGGACAGACGCGCAGAAGGAAGGGGCTGGGGGAACCCTGCTGAGACCTTTCAAAGAGCAACCCTGTTGGTTGGACTCTTCATTTTTAATTGTTATTCAATGTTTTATTTTCCAGAACTCATTTTTCACATTC
  To see the whole sequence, do 
    dl-fasta-subset NM_011808.2 < /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/sample.rsem-workingdir/reference.transcripts.fa |less
  Note: it is actually important to include the poly(A) tail. Search in
  /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/sample.rsem-workingdir/reference.seq
  or use ./NM_011808.2.fasta for this.
- ws3's scaffold_25_contig_4:
  $ dl-fasta-subset scaffold_25_contig_4 <ws3.fa
  >scaffold_25_contig_4 scaffold_25 tid=9388 pos=672 length=2123 number_of_contigs=12
  CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG

Some notes on the above:
- ws3's scaffold_25_contig_4 is an exact substring of the fragment given above
  of the isoform NM_011808.2.
- The reverse complemented version of the read SRR203276.44266061 from
  sample_1.fq is an exact substring of scaffold_25_contig_4.
- Neither the read SRR203276.44266061 from sample_2.fq nor its reverse
  complement is an exact substring of scaffold_25_contig_4.
  However:
    $ dl-fasta-subset NM_011808.2 < /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/sample.rsem-workingdir/reference.transcripts.fa|~/summer11/rnaseq-eval/git/fasta-join|grep CCGCTGCCCTGCCCAACCACAAGCCCAAGGGCACCTTCAAGGACTATGTGCGTGACCGTGCTGACCTCAACAAGGA|wc -l
    1
  i.e., this read (the non-reverse complemented version) does match the
  isoform NM_011808.2 somewhere else.

Maybe the above explains why in (WH3) above there are no alignments of reads
to scaffold_25_contig_4, even though in (WH1) there were alignments of reads
to the isoform NM_011808.2:
- if only one element of the pair of reads aligns to scaffold_25_contig_4,
  while the other element of the pair aligns somewhere else within
  NM_011808.2, then bowtie would align that pair to NM_011808.2 but not to
  scaffold_25_contig_4.
If this happened (or neither element of the pair aligned) not just for the
read SRR203276.44266061 but for all reads that align to NM_011808.2, then
there could be no expression for scaffold_25_contig_4 even though there is
NM_011808.2.

Indeed:
- From 
    $ zcat ws3.expression.sam.gz|grep bowtie
    @PG     ID:Bowtie       VN:0.12.7       CL:"bowtie -q --phred33-quals -n 2 -e 99999999 -l 25 -I 1 -X 1000 -p 8 -a -m 200 -S /tier2/deweylab/nathanae/rnaseq-eval/data-jul26//trinity-gridsearch/rsem_oracleset-summary.ws3.rsem-eval-workingdir/reference -1 /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/sample_1.fq -2 /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/sample_2.fq"
  it appears that bowtie was run in paired-read mode (options -1 and -2 are
  present), so bowtie does require that both reads align: from the manual,
    """
    A valid paired-end alignment satisfies these criteria:
    1. Both mates have a valid alignment according to the alignment policy
    defined by the -v/-n/-e/-l options.
    2. The relative orientation and position of the mates satisfy the
    constraints defined by the -I/-X/--fr/--rf/--ff options.
    """
  So indeed, it is correct that SRR203276.44266061, at least, doesn't align to
  ws3's scaffold_25_contig_4 but does align to the isoform NM_011808.2.
- From
    $ samtools view /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression.sorted.bam|grep NM_011808.2|cut -f 1 |sort|uniq|wc -l
    52
    $ samtools view /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression.sorted.bam|grep NM_011808.2|cut -f 1 |wc -l
    104
  we see that there are 52 pairs of reads  (out of 68627 total pairs) that
  align to NM_011808.2 using rsem's standard bowtie options.
  From
    $ samtools view /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression.sorted.bam|grep NM_011808.2|column - t|cut -f 6|grep 76M|wc -l
    104
  we see that they all have 76 matches and nothing else - i.e., all the reads
  that align to NM_011808.2 at all are exact substrings of it.
  Next, call the output of 
    $ samtools view /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression.sorted.bam|grep NM_011808.2|column - t|less -S -#2
  "the sam output for NM_011808.2".
  Note that:
  - The sequence
    CGGGGCAATAGCTCCTGAGCAGACTTTCAGAAGGGAGAGGTCTTCTCAGAAGCCTGTTGGACCTGGCTTGCAGAG
    (reads SRR203276.57814 and SRR203276.50638906) starts in fact at position
    (1-based) 1884 of NM_011808.2's sequence. In the sam output for
    NM_011808.2, this sequence occurs in lines with 83 in column 2 (FLAG) and
    1884 in column 4 (POS).
  - The sequence
    AAAAAAACTTAAGTGTCCGTTTTTTTTTTTTTTTTTAATAAAAAAAAAAAAAAATTTCCTTTTGTTTTGGGGTTT
    (read SRR203276.8083540) does not occur in NM_011808.2's sequence. Neither
    does its reverse complement
    AAACCCCAAAACAAAAGGAAATTTTTTTTTTTTTTTATTAAAAAAAAAAAAAAAAACGGACACTTAAGTTTTTTT
    However, the 25-base fragment AAAAAAACTTAAGTGTCCGTTTTTT occurs in
    NM_011808.2's sequence starting at position 1962. The rest of the sequence
    roughly matches. In the sam output for NM_011808.2, the sequence occurs in
    a line with 163 in column 2 (FLAG) 1962 in column 4 (POS).
  - For every pair of reads, one element of the pair has 163 in column 2, and
    one element of the pair has 83 in column 2.
  - ws3's scaffold_25_contig_4 has sequence
CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG
    which occurs in NM_011808.2's sequence starting at position 1439 and goes
    for 186 bases to position 186-1+1439=1624.
  Our hypothesis is that for every pair of reads in the sam output for
  NM_011808.2, at most one read is aligned to NM_011808.2 in the interval [1439,1624].
  Since all the reads are 76 bases long and start at the position given in
  column 4, we can check this hypothesis as follows:
    fn=/tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression.sorted.bam
    pairs=`samtools view $fn|grep NM_011808.2|cut -f 1 |sort|uniq`
    reads="83 163"
    for i in $pairs; do
      for read in $reads; do
        echo -n "$i $read"
        startpos=`samtools view $fn|grep NM_011808.2|grep ^$i$'\t'$read|cut -f 4`
        endpos=`expr 76 - 1 + $startpos`
        if [ "$startpos" -ge "1439" ]; then
          if [ "$endpos" -le "1624" ]; then
            echo -n " match"
          fi
        fi
        echo
      done
    done
  This script prints two lines per read pair, one for each element of the
  pair. Each line has "match" next to it if the sequence is aligned to
  NM_011808.2 in the interval [1439,1624], i.e., would be aligned to ws3's
  scaffold_25_contig_4. If our hypothesis is correct, at most one of each
  pair's two lines should have a match next to it. This is indeed what
  happens.
- It remains to establish that the read pair elements that do align to
  NM_011808.2 in the interval [1439,1624] do sufficiently cover that region to
  justify collectContigs (with windowSize 3) making a contig there. The reads
  that align to the interval are:
    SRR203276.17204380 83 match
    SRR203276.44266061 83 match
    SRR203276.5739758 163 match
  These reads are:
    $ cat ../sample_1.fq|dl-fastq-to-fasta|dl-fasta-subset SRR203276.17204380 SRR203276.44266061|dl-fasta-reformat -w 0
    >SRR203276.17204380 length=76
    CGGCTCAGTTTCTCATAATTCATCTTAGGTTTGTTTTTCCTCTTTCCCCATCTCCTGGCCACCTCATCTGGGTCAG
    >SRR203276.44266061 length=76
    GTGGATGATATTTTTGTCATAATAGTAGCGAAGGCCACGGCTCAGTTTCTCATAATTCATCTTAGGTTTGTTTTTC
    $ cat ../sample_2.fq|dl-fastq-to-fasta|dl-fasta-subset SRR203276.5739758|dl-fasta-reformat -w 0
    >SRR203276.5739758 length=76
    CACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG
  the _1.fq reads have reverse complement
    $ cat ../sample_1.fq|dl-fastq-to-fasta|dl-fasta-subset SRR203276.17204380 SRR203276.44266061|dl-fasta-reverse-complement|dl-fasta-reformat -w 0
    >SRR203276.17204380 length=76
    CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCG
    >SRR203276.44266061 length=76
    GAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCAC
  We make an alignment as follows:
  1 CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG
  2 CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCG
  3                                      GAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCAC
  4                                                                                                               CACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG
  Here line 1 is ws3's scaffold_25_contig_4, lines 2 and 3 are the matches
  from sample_1.fq, reverse complemented, and line 4 is the match from
  sample_2.fq. All the matches are exact. We see that the contig is indeed covered
  with at least a 3-base overlap (windowSize 3) by reads, so collectContigs is
  justified to make this contig.