[All paths are relative to /u/nathanae/summer11/rnaseq-eval/git/data-jul26/ws3ws4analysis.] Here we are trying to figure out why there are zero-expression contigs in the output of collectContigs as run by make_rsem_oracleset.py. We continue our investigation of ws3 and ws4; here, ws3's scaffold_25_contig_4 has zero expression and is split up, in ws4, into scaffold_25_contig_4 and scaffold_26_contig_0, both of which have zero expression. Based on the analysis in difference.txt, it looks like SRR203276.44266061 and SRR203276.5739758 are the two reads from sample_(1|2).fq that the above contigs are made from. When we ran make_rsem_oracleset.py (which runs collectContigs) we gave --rsem-expression=/tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression as a parameter. The bam file in this directory was used by make_rsem_oracleset.py to determine which part of the genome each read best aligns to, and then that info was given to collectContigs. If we run samtools view /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression.sorted.bam|grep SRR203276.44266061 we see SRR203276.44266061 163 NM_011808.2 1217 100 76M = 1476 335 CCGCTGCCCTGCCCAACCACAAGCCCAAGGGCACCTTCAAGGACTATGTGCGTGACCGTGCTGACCTCAACAAGGA CCCCCCCCCCCCCCCCCCCCCCCCCCCCCDCCCCCCCCCCCDBCCCCCBCCCDCCCBCBCDDDDABDBDBDBBDBB ZW:f:1 SRR203276.44266061 83 NM_011808.2 1476 100 76M = 1217 -335 GAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCAC B@@@CBCCDBBCACCDCBDCCBBCACCCCCDCDCCDCCCCCCCCCCCCCCCCCCCCCCB@BBCCCCCCCCCCCCCC ZW:f:1 i.e., this read (both pairs) aligns to isoform NM_011808.2. Let's look for this isoform in other assemblies. It does not appear in ../oracleset.fa. That is not surprising, because: - If we look at dl-fasta-subset NM_011808.2 < /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/sample.rsem-workingdir/gbff.fa and we pick the following 25-mer from the output ATGACAAAAATATCATCCACAAGAC and we check for it or its reverse complement in sample_1.fq and sample_2.fq: ~/summer11/rnaseq-eval/git/fastq_grep.sh ATGACAAAAATATCATCCACAAGAC ../sample_1.fq ../sample_2.fq we find that there are no matches. Furthermore, NM_011808.2 does not occur (as the best match) in any of the assemblies produced by either collectContigs or trinity. So what exactly is happening right now again? - (WH1) We align the reads to the refseq gbff isoforms. We find that read SRR203276.44266061 aligns to isoform NM_011808.2 (both elements of the pair align). This is found using rsem's bowtie defaults. [In fact, both elements of the pair are exact substrings of the isoform or its rc - see below.] - (WH2) We run collectContigs with window size 3. We find a contig, namely ws3's scaffold_25_contig_4. - (WH3) We align the reads to the contig of ws3's assembly. We find that no reads align to contig scaffold_25_contig_4, according to bowtie run again with rsem's bowtie defaults. [The alignment for (WH1) is in /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression.sorted.bam, and the alignment for (WH3) is in ./ws3.expression.sorted.bam (and ./ws3.expression.sam.gz).] The sequences of all these reads/isoforms/contigs are as follows: - SRR203276.44266061: $ cat ../sample_1.fq |dl-fastq-to-fasta|dl-fasta-subset SRR203276.44266061 >SRR203276.44266061 length=76 GTGGATGATATTTTTGTCATAATAGTAGCGAAGGCCACGGCTCAGTTTCTCATAATTCATCTTAGGTTTGTTTTTC $ cat ../sample_1.fq |dl-fastq-to-fasta|dl-fasta-subset SRR203276.44266061|dl-fasta-reverse-complement >SRR203276.44266061 length=76 GAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCAC $ cat ../sample_2.fq |dl-fastq-to-fasta|dl-fasta-subset SRR203276.44266061 >SRR203276.44266061 length=76 CCGCTGCCCTGCCCAACCACAAGCCCAAGGGCACCTTCAAGGACTATGTGCGTGACCGTGCTGACCTCAACAAGGA $ cat ../sample_2.fq |dl-fastq-to-fasta|dl-fasta-subset SRR203276.44266061|dl-fasta-reverse-complement >SRR203276.44266061 length=76 TCCTTGTTGAGGTCAGCACGGTCACGCACATAGTCCTTGAAGGTGCCCTTGGGCTTGTGGTTGGGCAGGGCAGCGG - NM_011808.2 is very long. It begins: >NM_011808.2 CGGAGGCGAGCTGCACTGCGCCAGCGCCGGCCAGCCGTGCGAGCGAGCAAGGGAGCGAGCGCCCCGGACGGAGGAGGGAGCGAGCGCCCGGGACAGAGGAGCGAGCGGGCGGGCGAGGCG and ends: ATTTTTTATATACTGTAGGCTGATTTCATATTGTATTTTAAACTGTGTGAAATTAAAGAAACAAAGAAATTCATTCATAA In the middle somewhere occurs the fragment: TATGTGCGTGACCGTGCTGACCTCAACAAGGACAAGCCTGTCATTCCTGCTGCTGCCCTGGCTGGCTACACAGGAAGTGGGCCGATCCAGCTGTGGCAGTTTCTTCTGGAATTACTCACTGATAAGTCTTGTCAGTCCTTTATCAGCTGGACAGGAGATGGCTGGGAATTCAAGCTTTCTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAGAGCTGCACGCCATGCTGGATGTAAAGCCGGATGCTGACTAGTCATGGACAGACGCGCAGAAGGAAGGGGCTGGGGGAACCCTGCTGAGACCTTTCAAAGAGCAACCCTGTTGGTTGGACTCTTCATTTTTAATTGTTATTCAATGTTTTATTTTCCAGAACTCATTTTTCACATTC To see the whole sequence, do dl-fasta-subset NM_011808.2 < /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/sample.rsem-workingdir/reference.transcripts.fa |less Note: it is actually important to include the poly(A) tail. Search in /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/sample.rsem-workingdir/reference.seq or use ./NM_011808.2.fasta for this. - ws3's scaffold_25_contig_4: $ dl-fasta-subset scaffold_25_contig_4 scaffold_25_contig_4 scaffold_25 tid=9388 pos=672 length=2123 number_of_contigs=12 CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG Some notes on the above: - ws3's scaffold_25_contig_4 is an exact substring of the fragment given above of the isoform NM_011808.2. - The reverse complemented version of the read SRR203276.44266061 from sample_1.fq is an exact substring of scaffold_25_contig_4. - Neither the read SRR203276.44266061 from sample_2.fq nor its reverse complement is an exact substring of scaffold_25_contig_4. However: $ dl-fasta-subset NM_011808.2 < /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/sample.rsem-workingdir/reference.transcripts.fa|~/summer11/rnaseq-eval/git/fasta-join|grep CCGCTGCCCTGCCCAACCACAAGCCCAAGGGCACCTTCAAGGACTATGTGCGTGACCGTGCTGACCTCAACAAGGA|wc -l 1 i.e., this read (the non-reverse complemented version) does match the isoform NM_011808.2 somewhere else. Maybe the above explains why in (WH3) above there are no alignments of reads to scaffold_25_contig_4, even though in (WH1) there were alignments of reads to the isoform NM_011808.2: - if only one element of the pair of reads aligns to scaffold_25_contig_4, while the other element of the pair aligns somewhere else within NM_011808.2, then bowtie would align that pair to NM_011808.2 but not to scaffold_25_contig_4. If this happened (or neither element of the pair aligned) not just for the read SRR203276.44266061 but for all reads that align to NM_011808.2, then there could be no expression for scaffold_25_contig_4 even though there is NM_011808.2. Indeed: - From $ zcat ws3.expression.sam.gz|grep bowtie @PG ID:Bowtie VN:0.12.7 CL:"bowtie -q --phred33-quals -n 2 -e 99999999 -l 25 -I 1 -X 1000 -p 8 -a -m 200 -S /tier2/deweylab/nathanae/rnaseq-eval/data-jul26//trinity-gridsearch/rsem_oracleset-summary.ws3.rsem-eval-workingdir/reference -1 /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/sample_1.fq -2 /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/sample_2.fq" it appears that bowtie was run in paired-read mode (options -1 and -2 are present), so bowtie does require that both reads align: from the manual, """ A valid paired-end alignment satisfies these criteria: 1. Both mates have a valid alignment according to the alignment policy defined by the -v/-n/-e/-l options. 2. The relative orientation and position of the mates satisfy the constraints defined by the -I/-X/--fr/--rf/--ff options. """ So indeed, it is correct that SRR203276.44266061, at least, doesn't align to ws3's scaffold_25_contig_4 but does align to the isoform NM_011808.2. - From $ samtools view /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression.sorted.bam|grep NM_011808.2|cut -f 1 |sort|uniq|wc -l 52 $ samtools view /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression.sorted.bam|grep NM_011808.2|cut -f 1 |wc -l 104 we see that there are 52 pairs of reads (out of 68627 total pairs) that align to NM_011808.2 using rsem's standard bowtie options. From $ samtools view /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression.sorted.bam|grep NM_011808.2|column - t|cut -f 6|grep 76M|wc -l 104 we see that they all have 76 matches and nothing else - i.e., all the reads that align to NM_011808.2 at all are exact substrings of it. Next, call the output of $ samtools view /tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression.sorted.bam|grep NM_011808.2|column - t|less -S -#2 "the sam output for NM_011808.2". Note that: - The sequence CGGGGCAATAGCTCCTGAGCAGACTTTCAGAAGGGAGAGGTCTTCTCAGAAGCCTGTTGGACCTGGCTTGCAGAG (reads SRR203276.57814 and SRR203276.50638906) starts in fact at position (1-based) 1884 of NM_011808.2's sequence. In the sam output for NM_011808.2, this sequence occurs in lines with 83 in column 2 (FLAG) and 1884 in column 4 (POS). - The sequence AAAAAAACTTAAGTGTCCGTTTTTTTTTTTTTTTTTAATAAAAAAAAAAAAAAATTTCCTTTTGTTTTGGGGTTT (read SRR203276.8083540) does not occur in NM_011808.2's sequence. Neither does its reverse complement AAACCCCAAAACAAAAGGAAATTTTTTTTTTTTTTTATTAAAAAAAAAAAAAAAAACGGACACTTAAGTTTTTTT However, the 25-base fragment AAAAAAACTTAAGTGTCCGTTTTTT occurs in NM_011808.2's sequence starting at position 1962. The rest of the sequence roughly matches. In the sam output for NM_011808.2, the sequence occurs in a line with 163 in column 2 (FLAG) 1962 in column 4 (POS). - For every pair of reads, one element of the pair has 163 in column 2, and one element of the pair has 83 in column 2. - ws3's scaffold_25_contig_4 has sequence CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG which occurs in NM_011808.2's sequence starting at position 1439 and goes for 186 bases to position 186-1+1439=1624. Our hypothesis is that for every pair of reads in the sam output for NM_011808.2, at most one read is aligned to NM_011808.2 in the interval [1439,1624]. Since all the reads are 76 bases long and start at the position given in column 4, we can check this hypothesis as follows: fn=/tier2/deweylab/nathanae/rnaseq-eval/data-jul26/expression.rsem-expr-workingdir/expression.sorted.bam pairs=`samtools view $fn|grep NM_011808.2|cut -f 1 |sort|uniq` reads="83 163" for i in $pairs; do for read in $reads; do echo -n "$i $read" startpos=`samtools view $fn|grep NM_011808.2|grep ^$i$'\t'$read|cut -f 4` endpos=`expr 76 - 1 + $startpos` if [ "$startpos" -ge "1439" ]; then if [ "$endpos" -le "1624" ]; then echo -n " match" fi fi echo done done This script prints two lines per read pair, one for each element of the pair. Each line has "match" next to it if the sequence is aligned to NM_011808.2 in the interval [1439,1624], i.e., would be aligned to ws3's scaffold_25_contig_4. If our hypothesis is correct, at most one of each pair's two lines should have a match next to it. This is indeed what happens. - It remains to establish that the read pair elements that do align to NM_011808.2 in the interval [1439,1624] do sufficiently cover that region to justify collectContigs (with windowSize 3) making a contig there. The reads that align to the interval are: SRR203276.17204380 83 match SRR203276.44266061 83 match SRR203276.5739758 163 match These reads are: $ cat ../sample_1.fq|dl-fastq-to-fasta|dl-fasta-subset SRR203276.17204380 SRR203276.44266061|dl-fasta-reformat -w 0 >SRR203276.17204380 length=76 CGGCTCAGTTTCTCATAATTCATCTTAGGTTTGTTTTTCCTCTTTCCCCATCTCCTGGCCACCTCATCTGGGTCAG >SRR203276.44266061 length=76 GTGGATGATATTTTTGTCATAATAGTAGCGAAGGCCACGGCTCAGTTTCTCATAATTCATCTTAGGTTTGTTTTTC $ cat ../sample_2.fq|dl-fastq-to-fasta|dl-fasta-subset SRR203276.5739758|dl-fasta-reformat -w 0 >SRR203276.5739758 length=76 CACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG the _1.fq reads have reverse complement $ cat ../sample_1.fq|dl-fastq-to-fasta|dl-fasta-subset SRR203276.17204380 SRR203276.44266061|dl-fasta-reverse-complement|dl-fasta-reformat -w 0 >SRR203276.17204380 length=76 CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCG >SRR203276.44266061 length=76 GAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCAC We make an alignment as follows: 1 CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG 2 CTGACCCAGATGAGGTGGCCAGGAGATGGGGAAAGAGGAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCG 3 GAAAAACAAACCTAAGATGAATTATGAGAAACTGAGCCGTGGCCTTCGCTACTATTATGACAAAAATATCATCCAC 4 CACAAGACGGCGGGCAAGCGCTACGTATACCGCTTTGTGTGCGACCTGCAGAGCCTGCTGGGATACACCCCTGAAG Here line 1 is ws3's scaffold_25_contig_4, lines 2 and 3 are the matches from sample_1.fq, reverse complemented, and line 4 is the match from sample_2.fq. All the matches are exact. We see that the contig is indeed covered with at least a 3-base overlap (windowSize 3) by reads, so collectContigs is justified to make this contig.