Basic assembly statistics

We define:

N25: the number $L$ s.t. 25% of all bases in the assembly are in a contig of length less than $L$.
N50: the number $L$ s.t. 50% of all bases in the assembly are in a contig of length less than $L$.
N75: the number $L$ s.t. 75% of all bases in the assembly are in a contig of length less than $L$.
longest_length: the length of the longest contig in the assembly.
mean_length: the mean contig length in the assembly.
median_length: the median contig length in the assembly.
shortest_length: the length of the shortest contig in the assembly.
num_contigs: the number of contigs in the assembly.
num_contigs_longer_than_$n$_bp: the number of contigs in the assembly longer than $n$ bp, for $n$ = 100 and 300.

Precision-recall statistics

Let $A$ be an assembly and $B$ be an oracleset. By $s\in A$, we mean that $s$ is a contig in $A$. By $s[i]$ we mean the $i$th base in $s$.

Fix min_frac_identity parameter $\alpha$ $\geq$ 0.8. If desired, fix max_frac_indel parameter $\beta$.

We use the following procedure to compute nucleotide-level precision and recall statistics.

If $\beta$ is not given, set $\beta=1$.
Run blat -minIdentity=80 B.fa A.fa A_to_B.psl to align the sequences in $A$ to those in $B$, excluding those alignments with less than 80% identity according to blat.
Compute the following quantities based on these alignments:
- num_mismatching_bases: Total number of mismatching (nonidentical) bases in the alignments. If this is too high, something might be wrong.
- num_raw_matching_bases: Total number of matching (identical) bases in all the alignments, with bases counted multiple times if they are involved in multiple alignments.
- num_uniq_matching_bases: Total number of bases in B that match at least one aligned base in A. Each base in B is counted only once, no matter how many bases in A it aligns to.
We compute these quantities as follows:
- For each $t\in B$, initialize mask($t$) as a string of all 0's, the same length as $t$.
- Initialize num_raw_matching_bases = num_uniq_matching_bases = num_mismatching_bases = 0.
- For each alignment $q \to t$ where $q\in A$ and $t\in B$:
  - If num_identity($q\to t$)/length'($q$) < $\alpha$ or if num_indel($q\to t$)/length'($q$) > $\beta$, then skip this alignment. (Here, length'($q$) is the number of non-$\mathtt N$ bases in $q$.)
  - Otherwise: We can think of the alignment $q\to t$ as a set of pairs $(k,l)$ where $k$ indexes $q$ and $l$ indexes $t$. For each pair $(k,l)$ in the alignment:
    - If $q[k] = t[l]$ and $t[l] \neq \mathtt N$:
      - Increment num_raw_matching_bases.
      - If mask($t$)$[l] = 0$:
        
        Increment num_uniq_matching_bases.
        Set mask($t$)$[l] = 1$.
    - Else: Increment num_mismatching_bases.
Let num_bases_in_$A = \sum_{q\in A}$ length($q$) and num_bases_in_$B = \sum_{t\in B}$ length($t$). (Here, length($q$) is the number of bases in $q$, including $\mathtt N$'s.)
Let nucl_precision_at_$\alpha$_and_$\beta$ = num_uniq_matching_bases / num_bases_in_$A$.
Let nucl_recall_at_$\alpha$_and_$\beta$ = num_uniq_matching_bases / num_bases_in_$B$.

We use the following procedure to compute transcript-level precision and recall statistics:

If $\beta$ is not given, set $\beta=1$.
Run blat -minIdentity=80 A.fa B.fa B_to_A.psl to align the sequences in $B$ to those in $A$, excluding those alignments with less than 80% identity according to blat.
Compute the following quantities based on these alignments:
- num_raw_alignments: Total number of alignments.
- num_uniq_alignments: Number of elements of B that are involved in at least one alignment.
We compute these quantities as follows:
- For each $q \in B$, initialize mask($q$) = 0.
- Initialize num_raw_alignments = num_uniq_alignments = 0.
- For each alignment $q \to t$ where $q\in B$ and $t\in A$:
  - If num_identity($q\to t$)/length'($q$) < $\alpha$ or if num_indel($q\to t$)/length'($q$) > $\beta$, then skip this alignment. (Here, length'($q$) is the number of non-$\mathtt N$ bases in $q$.)
  - Otherwise: Mark the current oracleset element $q$ as having an alignment, and increment the number of oracleset elements with alignments as appropriate. In detail:
    - Increment num_raw_alignments.
    - If mask($q$) = 0:
      - Increment num_uniq_alignments.
      - Set mask($q$) = 1.
Let num_contigs_in_$A = |A|$ and num_contigs_in_$B = |B|$.
Let transcript_precision_at_$\alpha$_and_$\beta$ = num_uniq_alignments / num_contigs_in_$A$.
Let transcript_recall_at_$\alpha$_and_$\beta$ = num_uniq_alignments / num_contigs_in_$B$.

RSEM approx columns

These columns come from expression.approx in RSEM's output.

rsem.approx.approx	I believe that this is the log model evidence, log P(D), computed using a convex approximation.
rsem.approx.bic	Log model evidence, log P(D), computed using BIC.
rsem.approx.loglikelihood	I believe that this is the log likelihood, log P(D\|\theta), computed at the MAP \theta.
rsem.approx.loglikelihood.penalty	This is the BIC penalty.

It should be the case that rsem.approx.bic = rsem.approx.loglikelihood - rsem.approx.loglikelihood.penalty.

RSEM eval columns

These columns come from expression.eval in RSEM's output.

rsem.eval.lognumer.minus.logdenom	I believe that this is \log P(D\|\theta') + \log P(\theta') - \log P(\theta'\|D), where \theta' is a posterior mean estimate
rsem.eval.logprior	I believe that this is \log P(\theta')
rsem.eval.loglikelihood	I believe that this is \log P(D\|\theta')
rsem.eval.logdenom	I believe that this is \log P(\theta'\|D)

RSEM prior columns

These columns come from expression.prior in RSEM's output.

rsem.prior.log.prob.M	This is \log P(M).
rsem.prior.log.prob.L.given.M	This is \log P(L\|M).
rsem.prior.log.prob.Sequences.given.L.and.M	This is \log P(Sequences\|L, M).
rsem.prior.log.prob.A	This is \log P(A) = \log P(M) + \log P(L\|M) + \log P(Sequences\|L, M).
rsem.eval.loglikelihood.plus.rsem.prior.log.prob.A	This is \log P(A) + rsem.eval.loglikelihood
rsem.approx.loglikelihood.plus.rsem.prior.log.prob.A	This is \log P(A) + rsem.approx.loglikelihood
rsem.approx.approx.plus.rsem.prior.log.prob.A	This is \log P(A) + rsem.approx.approx
rsem.approx.bic.plus.rsem.prior.log.prob.A	This is \log P(A) + rsem.approx.bic

RSEM ss columns

These columns come from expression.ss in RSEM's output.

rsem.ss.mean.num.reads.per.transcript	Mean number of reads aligning to each transcript (based on countvs).
rsem.ss.median.num.reads.per.transcript	Median number of reads aligning to each transcript (based on countvs).
rsem.ss.num.transcripts.with.zero.reads	Number of transcripts with no reads aligning to them (based on countvs).
rsem.ss.num.matching.bases	Number of matching bases, based on the qpro profiles.
rsem.ss.num.mismatching.bases	Number of mismatching bases, based on the qpro profiles.