Basic assembly statistics

We define:

Precision-recall statistics

Let $A$ be an assembly and $B$ be an oracleset. By $s\in A$, we mean that $s$ is a contig in $A$. By $s[i]$ we mean the $i$th base in $s$.

Fix min_frac_identity parameter $\alpha$ $\geq$ 0.8. If desired, fix max_frac_indel parameter $\beta$.

We use the following procedure to compute nucleotide-level precision and recall statistics.

We use the following procedure to compute transcript-level precision and recall statistics:

RSEM approx columns

These columns come from expression.approx in RSEM's output.
rsem.approx.approx I believe that this is the log model evidence, log P(D), computed using a convex approximation.
rsem.approx.bic Log model evidence, log P(D), computed using BIC.
rsem.approx.loglikelihood I believe that this is the log likelihood, log P(D|\theta), computed at the MAP \theta.
rsem.approx.loglikelihood.penalty This is the BIC penalty.
It should be the case that rsem.approx.bic = rsem.approx.loglikelihood - rsem.approx.loglikelihood.penalty.

RSEM eval columns

These columns come from expression.eval in RSEM's output.
rsem.eval.lognumer.minus.logdenom I believe that this is \log P(D|\theta') + \log P(\theta') - \log P(\theta'|D), where \theta' is a posterior mean estimate
rsem.eval.logprior I believe that this is \log P(\theta')
rsem.eval.loglikelihood I believe that this is \log P(D|\theta')
rsem.eval.logdenom I believe that this is \log P(\theta'|D)

RSEM prior columns

These columns come from expression.prior in RSEM's output.
rsem.prior.log.prob.M This is \log P(M).
rsem.prior.log.prob.L.given.M This is \log P(L|M).
rsem.prior.log.prob.Sequences.given.L.and.M This is \log P(Sequences|L, M).
rsem.prior.log.prob.A This is \log P(A) = \log P(M) + \log P(L|M) + \log P(Sequences|L, M).
rsem.eval.loglikelihood.plus.rsem.prior.log.prob.A This is \log P(A) + rsem.eval.loglikelihood
rsem.approx.loglikelihood.plus.rsem.prior.log.prob.A This is \log P(A) + rsem.approx.loglikelihood
rsem.approx.approx.plus.rsem.prior.log.prob.A This is \log P(A) + rsem.approx.approx
rsem.approx.bic.plus.rsem.prior.log.prob.A This is \log P(A) + rsem.approx.bic

RSEM ss columns

These columns come from expression.ss in RSEM's output.
rsem.ss.mean.num.reads.per.transcript Mean number of reads aligning to each transcript (based on countvs).
rsem.ss.median.num.reads.per.transcript Median number of reads aligning to each transcript (based on countvs).
rsem.ss.num.transcripts.with.zero.reads Number of transcripts with no reads aligning to them (based on countvs).
rsem.ss.num.matching.bases Number of matching bases, based on the qpro profiles.
rsem.ss.num.mismatching.bases Number of mismatching bases, based on the qpro profiles.