Basic assembly statistics
We define:
- N25: the number $L$ s.t. 25% of all bases in the assembly are in a contig of length less than $L$.
- N50: the number $L$ s.t. 50% of all bases in the assembly are in a contig of length less than $L$.
- N75: the number $L$ s.t. 75% of all bases in the assembly are in a contig of length less than $L$.
- longest_length: the length of the longest contig in the assembly.
- mean_length: the mean contig length in the assembly.
- median_length: the median contig length in the assembly.
- shortest_length: the length of the shortest contig in the assembly.
- num_contigs: the number of contigs in the assembly.
- num_contigs_longer_than_$n$_bp: the number of contigs in the assembly longer than $n$ bp, for $n$ = 100 and 300.
Precision-recall statistics
Let $A$ be an assembly and $B$ be an oracleset. By $s\in A$, we mean that $s$ is a contig in $A$. By $s[i]$ we mean the $i$th base in $s$.
Fix min_frac_identity parameter $\alpha$ $\geq$ 0.8. If desired, fix
max_frac_indel parameter $\beta$.
We use the following procedure to compute nucleotide-level precision and
recall statistics.
- If $\beta$ is not given, set $\beta=1$.
- Run blat -minIdentity=80 B.fa A.fa A_to_B.psl to align the
sequences in $A$ to those in $B$, excluding those alignments with less than 80%
identity according to blat.
- Compute the following quantities based on these alignments:
- num_mismatching_bases: Total number of mismatching (nonidentical) bases
in the alignments. If this is too high,
something might be wrong.
- num_raw_matching_bases: Total number of matching (identical) bases in
all the alignments, with bases counted multiple
times if they are involved in multiple
alignments.
- num_uniq_matching_bases: Total number of bases in B that match at least
one aligned base in A. Each base in B is counted
only once, no matter how many bases in A it
aligns to.
- We compute these quantities as follows:
- For each $t\in B$, initialize mask($t$) as a string of all 0's, the same length as $t$.
- Initialize num_raw_matching_bases = num_uniq_matching_bases = num_mismatching_bases = 0.
- For each alignment $q \to t$ where $q\in A$ and $t\in B$:
- If num_identity($q\to t$)/length'($q$) < $\alpha$ or if
num_indel($q\to t$)/length'($q$) > $\beta$, then skip this alignment.
(Here, length'($q$) is the number of non-$\mathtt N$ bases in $q$.)
- Otherwise: We can think of the alignment $q\to t$ as a set of pairs
$(k,l)$ where $k$ indexes $q$ and $l$ indexes $t$. For each pair $(k,l)$ in
the alignment:
- If $q[k] = t[l]$ and $t[l] \neq \mathtt N$:
- Increment num_raw_matching_bases.
- If mask($t$)$[l] = 0$:
- Increment num_uniq_matching_bases.
- Set mask($t$)$[l] = 1$.
- Else: Increment num_mismatching_bases.
- Let num_bases_in_$A = \sum_{q\in A}$ length($q$) and
num_bases_in_$B = \sum_{t\in B}$ length($t$).
(Here, length($q$) is the number of bases in $q$, including $\mathtt N$'s.)
- Let nucl_precision_at_$\alpha$_and_$\beta$ = num_uniq_matching_bases / num_bases_in_$A$.
- Let nucl_recall_at_$\alpha$_and_$\beta$ = num_uniq_matching_bases / num_bases_in_$B$.
We use the following procedure to compute transcript-level precision and
recall statistics:
- If $\beta$ is not given, set $\beta=1$.
- Run blat -minIdentity=80 A.fa B.fa B_to_A.psl to align the
sequences in $B$ to those in $A$, excluding those alignments with less than 80%
identity according to blat.
- Compute the following quantities based on these alignments:
- num_raw_alignments: Total number of alignments.
- num_uniq_alignments: Number of elements of B that are involved in at
least one alignment.
- We compute these quantities as follows:
- For each $q \in B$, initialize mask($q$) = 0.
- Initialize num_raw_alignments = num_uniq_alignments = 0.
- For each alignment $q \to t$ where $q\in B$ and $t\in A$:
- If num_identity($q\to t$)/length'($q$) < $\alpha$ or if
num_indel($q\to t$)/length'($q$) > $\beta$, then skip this alignment.
(Here, length'($q$) is the number of non-$\mathtt N$ bases in $q$.)
- Otherwise: Mark the current oracleset element $q$ as having an
alignment, and increment the number of oracleset elements with alignments
as appropriate. In detail:
- Increment num_raw_alignments.
- If mask($q$) = 0:
- Increment num_uniq_alignments.
- Set mask($q$) = 1.
- Let num_contigs_in_$A = |A|$ and
num_contigs_in_$B = |B|$.
- Let transcript_precision_at_$\alpha$_and_$\beta$ = num_uniq_alignments / num_contigs_in_$A$.
- Let transcript_recall_at_$\alpha$_and_$\beta$ = num_uniq_alignments / num_contigs_in_$B$.
RSEM approx columns
These columns come from expression.approx in RSEM's output.
rsem.approx.approx |
I believe that this is the log model evidence, log P(D), computed using a convex approximation. |
rsem.approx.bic |
Log model evidence, log P(D), computed using BIC. |
rsem.approx.loglikelihood |
I believe that this is the log likelihood, log P(D|\theta), computed at the MAP \theta. |
rsem.approx.loglikelihood.penalty |
This is the BIC penalty. |
It should be the case that rsem.approx.bic = rsem.approx.loglikelihood - rsem.approx.loglikelihood.penalty.
RSEM eval columns
These columns come from expression.eval in RSEM's output.
rsem.eval.lognumer.minus.logdenom |
I believe that this is \log P(D|\theta') + \log P(\theta') - \log
P(\theta'|D), where \theta' is a posterior mean estimate |
rsem.eval.logprior |
I believe that this is \log P(\theta') |
rsem.eval.loglikelihood |
I believe that this is \log P(D|\theta') |
rsem.eval.logdenom |
I believe that this is \log P(\theta'|D) |
RSEM prior columns
These columns come from expression.prior in RSEM's output.
rsem.prior.log.prob.M |
This is \log P(M). |
rsem.prior.log.prob.L.given.M |
This is \log P(L|M). |
rsem.prior.log.prob.Sequences.given.L.and.M |
This is \log P(Sequences|L, M). |
rsem.prior.log.prob.A |
This is \log P(A) = \log P(M) + \log P(L|M) + \log P(Sequences|L, M). |
rsem.eval.loglikelihood.plus.rsem.prior.log.prob.A |
This is \log P(A) + rsem.eval.loglikelihood |
rsem.approx.loglikelihood.plus.rsem.prior.log.prob.A |
This is \log P(A) + rsem.approx.loglikelihood |
rsem.approx.approx.plus.rsem.prior.log.prob.A |
This is \log P(A) + rsem.approx.approx |
rsem.approx.bic.plus.rsem.prior.log.prob.A |
This is \log P(A) + rsem.approx.bic |
RSEM ss columns
These columns come from expression.ss in RSEM's output.
rsem.ss.mean.num.reads.per.transcript |
Mean number of reads aligning to each transcript (based on countvs). |
rsem.ss.median.num.reads.per.transcript |
Median number of reads aligning to each transcript (based on countvs). |
rsem.ss.num.transcripts.with.zero.reads |
Number of transcripts with no reads aligning to them (based on countvs). |
rsem.ss.num.matching.bases |
Number of matching bases, based on the qpro profiles. |
rsem.ss.num.mismatching.bases |
Number of mismatching bases, based on the qpro profiles. |