Basic assembly statistics

We define: $ \def\match{\mathrm{match}} \def\matches{\mathrm{matches}} \def\leftmatches{\mathrm{left\_matches}} \def\proj{\mathrm{proj}} \def\calA{\mathcal A} \def\calB{\mathcal B} \def\fracIdentity{\mathrm{frac\_identity}} \def\fracIndel{\mathrm{frac\_indel}} \def\mask{\mathrm{mask}} \def\fracOnes{\mathrm{frac\_ones}} \def\numOnes{\mathrm{num\_ones}} \def\length{\mathrm{length}} \def\N{\mathbb{N}} \def\precision{\mathrm{precision}} \def\recall{\mathrm{recall}} \def\F1{\mathrm{F1}} \def\ok{\mathrm{ok}} \def\compare{\mathrm{compare}} $

Weighted_2 precision and recall statistics

In general, we may define To get a specific precision-recall pair, we need to define "probability" and "matches". We consider the following transcript-level definitions of "probability". Here $\proj_\calA(\mathcal S) = \cup_{t\in B} \{s \in \calA : (s,t) \in \mathcal S\}$, $\proj_\calB$ is defined similarly, and $\matches$ is defined below. The distribution of $A$ and $B$ is given by: where $\tau_\calA$ and $\tau_\calB$ are the transcript-level expression levels, according to RSEM.

We consider the following definitions of "matches".

The above definitions are used to compute weighted_2_transcript_1_{precision, recall, F1}_at_$\alpha$_and_$\beta$ and weighted_2_transcript_2_{precision, recall, F1}_at_$\alpha$_and_$\beta$, as follows: With these definitions, the precision and recall measure how well individual oracleset elements are recovered by individual assembly elements.

We may also consider the following definitions of "matches", which lead to precision and recall statistics that measure how well individual oracleset elements are recovered by collections of (possibly) several assembly elements.

The elements $b^*_i(a)$ are defined as follows: For each $b\in\calB$, the mask $\mask_i(b)$ is defined as follows: The function $\fracOnes$ is defined as follows: The weighted_2_transcript_$i$_{precision, recall, F1}_at_$\alpha$_and_$\beta$ are computed for $i\in\{3,4\}$ as for $i\in\{1,2\}$ above.

Finally, we can consider an alternative definition of "probability", focused on nucleotide-level expression instead of transcript-level expression. Let $(a,k)\in\calA\times\N$ refer to contig $a$'s $k$th base, and define $(b,l)\in\calB\times\N$ similarly.

The distribution of $(A,K)$ and $(B,L)$ is given by: where $\tau_\calA$ and $\tau_\calB$ are the transcript-level expression levels, according to RSEM.

We define $\matches$ as follows. For each $i\in\{3,4\}$, let

We also define unmatched versions. Let

This gives us, for $i\in\{1,2,3,4\}$:

Computation of the above

Main procedure: Procecure to compute $b_i^*(a)$: Procecure to compute $\mask_i(b)$: Procedure to compute $\matches_1(\calA,\calB)$: Procedure to compute $\matches_2(\calA,\calB)$: Procedure to compute $\matches_i(\calA,\calB)$ for $i\in\{3,4\}$: Procedure to compute $\matches_i(\calA\times\N,\calB\times\N)$ for $i\in\{3,4\}$: Procedure to compute weighted_2_transcript_i_{precision, recall, F1}_at_$\alpha$_and_$\beta$, for $i\in\{1,2,3,4\}$: Procedure to compute weighted_2_nucleotide_i_{precision, recall, F1}_at_$\alpha$_and_$\beta$, for $i\in\{1,2,3,4\}$:

Precision-recall statistics

Let $A$ be an assembly and $B$ be an oracleset. By $s\in A$, we mean that $s$ is a contig in $A$. By $s[i]$ we mean the $i$th base in $s$.

For any alignment $q\to t$, define:

Fix min_frac_identity parameter $\alpha$ $\geq$ 0.8. If desired, fix max_frac_indel parameter $\beta$.

Nucleotide-level statistics

We use the following procedure to compute nucleotide-level precision and recall statistics.

Transcript-level statistics

We use the following procedure to compute transcript-level precision and recall statistics:

Transcript_2-level statistics

We use the following procedure to compute transcript_2-level precision and recall statistics:

Transcript_3-level statistics

We use the following procedure to compute transcript_3-level precision and recall statistics. We reuse the mask($t$) vectors and the frac_identity vector created to compute the nucleotide-level statistics.

Motivation for the weighted variants above

For the unweighted versions, we defined: One can interpret these definitions as follows For the weighted versions, we instead define: In other words (using notation from the next subsection): Here "matches" is defined differently for the transcript, transcript_2, and transcript_3 statistics.

Similarly, for the unweighted versions, we defined:

One can interpret these definitions as follows For the weighted versions, we instead define:

RSEM approx columns

These columns come from expression.approx in RSEM's output.
rsem.approx.approx I believe that this is the log model evidence, log P(D), computed using a convex approximation.
rsem.approx.bic Log model evidence, log P(D), computed using BIC.
rsem.approx.loglikelihood I believe that this is the log likelihood, log P(D|\theta), computed at the MAP \theta.
rsem.approx.loglikelihood.penalty This is the BIC penalty.
It should be the case that rsem.approx.bic = rsem.approx.loglikelihood - rsem.approx.loglikelihood.penalty.

RSEM eval columns

These columns come from expression.eval in RSEM's output.
rsem.eval.lognumer.minus.logdenom I believe that this is \log P(D|\theta') + \log P(\theta') - \log P(\theta'|D), where \theta' is a posterior mean estimate
rsem.eval.logprior I believe that this is \log P(\theta')
rsem.eval.loglikelihood I believe that this is \log P(D|\theta')
rsem.eval.logdenom I believe that this is \log P(\theta'|D)

RSEM prior columns

These columns come from expression.prior in RSEM's output.
rsem.prior.log.prob.M This is \log P(M).
rsem.prior.log.prob.L.given.M This is \log P(L|M).
rsem.prior.log.prob.Sequences.given.L.and.M This is \log P(Sequences|L, M).
rsem.prior.log.prob.A This is \log P(A) = \log P(M) + \log P(L|M) + \log P(Sequences|L, M).
rsem.eval.loglikelihood.plus.rsem.prior.log.prob.A This is \log P(A) + rsem.eval.loglikelihood
rsem.approx.loglikelihood.plus.rsem.prior.log.prob.A This is \log P(A) + rsem.approx.loglikelihood
rsem.approx.approx.plus.rsem.prior.log.prob.A This is \log P(A) + rsem.approx.approx
rsem.approx.bic.plus.rsem.prior.log.prob.A This is \log P(A) + rsem.approx.bic

RSEM ss columns

These columns come from expression.ss in RSEM's output.
rsem.ss.mean.num.reads.per.transcript Mean number of reads aligning to each transcript (based on countvs).
rsem.ss.median.num.reads.per.transcript Median number of reads aligning to each transcript (based on countvs).
rsem.ss.num.transcripts.with.zero.reads Number of transcripts with no reads aligning to them (based on countvs).
rsem.ss.num.matching.bases Number of matching bases, based on the qpro profiles.
rsem.ss.num.mismatching.bases Number of mismatching bases, based on the qpro profiles.