Basic assembly statistics
We define:
- N25: the number $L$ s.t. 25% of all bases in the assembly are in a contig of length less than $L$.
- N50: the number $L$ s.t. 50% of all bases in the assembly are in a contig of length less than $L$.
- N75: the number $L$ s.t. 75% of all bases in the assembly are in a contig of length less than $L$.
- longest_length: the length of the longest contig in the assembly.
- mean_length: the mean contig length in the assembly.
- median_length: the median contig length in the assembly.
- shortest_length: the length of the shortest contig in the assembly.
- num_contigs: the number of contigs in the assembly.
- num_contigs_longer_than_$n$_bp: the number of contigs in the assembly longer than $n$ bp, for $n$ = 100 and 300.
$
\def\match{\mathrm{match}}
\def\matches{\mathrm{matches}}
\def\leftmatches{\mathrm{left\_matches}}
\def\proj{\mathrm{proj}}
\def\calA{\mathcal A}
\def\calB{\mathcal B}
\def\fracIdentity{\mathrm{frac\_identity}}
\def\fracIndel{\mathrm{frac\_indel}}
\def\mask{\mathrm{mask}}
\def\fracOnes{\mathrm{frac\_ones}}
\def\numOnes{\mathrm{num\_ones}}
\def\length{\mathrm{length}}
\def\N{\mathbb{N}}
\def\precision{\mathrm{precision}}
\def\recall{\mathrm{recall}}
\def\F1{\mathrm{F1}}
\def\ok{\mathrm{ok}}
\def\compare{\mathrm{compare}}
$
Weighted_2 precision and recall statistics
In general, we may define
- precision = probability that an assembly element matches an oracleset element, where assembly elements are considered random and oracleset elements are not.
- recall = probability that an oracleset element matches an assembly element, where oracleset elements are considered random and assembly elements are not.
To get a specific precision-recall pair, we need to define "probability" and "matches".
We consider the following transcript-level definitions of "probability".
- The probability that an assembly element matches an oracleset element is defined to be
$P(A \in \proj_{\calA}(\matches(\calA, \calB)))$, where $A$ is a random variable taking values in $\calA$.
- The probability that an oraclset element matches an assembly element is defined to be
$P(B \in \proj_{\calB}(\matches(\calA, \calB)))$, where $B$ is a random variable taking values in $\calB$.
Here $\proj_\calA(\mathcal S) = \cup_{t\in B} \{s \in \calA : (s,t) \in
\mathcal S\}$, $\proj_\calB$ is defined similarly, and $\matches$ is defined
below. The distribution of $A$ and $B$ is given by:
- $P(A = a) = \tau_\calA(a)$ for all $a\in\calA$.
- $P(B = b) = \tau_\calB(b)$ for all $b\in\calB$.
where $\tau_\calA$ and $\tau_\calB$ are the transcript-level expression levels,
according to RSEM.
We consider the following definitions of "matches".
- $\matches_{1,\alpha,\beta}(\calA, \calB) = \{(a,b) \in \calA\times\calB : \fracIdentity(a\to b) \geq \alpha, \fracIndel(a\to b) \leq \beta\}$, i.e., $(a,b)\in\calA\times\calB$ are a match if $a$ aligns to $b$ sufficiently well.
- $\matches_{2,\alpha,\beta}(\calA, \calB) = \{(a,b) \in \calA\times\calB : \fracIdentity(b\to a) \geq \alpha, \fracIndel(b\to a) \leq \beta\}$, i.e., $(a,b)\in\calA\times\calB$ are a match if $b$ aligns to $a$ sufficiently well.
The above definitions are used to compute
weighted_2_transcript_1_{precision, recall, F1}_at_$\alpha$_and_$\beta$ and
weighted_2_transcript_2_{precision, recall, F1}_at_$\alpha$_and_$\beta$,
as follows:
- weighted_2_transcript_$i$_precision_at_$\alpha$_and_$\beta = \sum_{a\in\calA} 1\left(a\in\proj_\calA(\matches_{i,\alpha,\beta}(\calA,\calB))\right) \tau_\calA(a)$
- weighted_2_transcript_$i$_recall_at_$\alpha$_and_$\beta = \sum_{b\in\calB} 1\left(b\in\proj_\calB(\matches_{i,\alpha,\beta}(\calA,\calB))\right) \tau_\calB(b)$
With these definitions, the precision and recall measure how well individual
oracleset elements are recovered by individual assembly elements.
We may also consider the following definitions of "matches", which lead to
precision and recall statistics that measure how well individual oracleset
elements are recovered by collections of (possibly) several assembly elements.
- $\matches_{3,\alpha,\beta}(\calA, \calB) = \{(a,b^*_3(a)) : a\in\calA, b^*_3(a) \neq \varnothing, \fracOnes(\mask_3(b^*_3(a))) \geq \alpha\}$.
- $\matches_{4,\alpha,\beta}(\calA, \calB) = \{(a,b^*_4(a)) : a\in\calA, b^*_4(a) \neq \varnothing, \fracOnes(\mask_4(b^*_4(a))) \geq \alpha\}$.
The elements $b^*_i(a)$ are defined as follows:
- For each $a\in\calA$, let $b^*_3(a)$ be the element of $\calB$ that maximizes (i) $\fracIdentity(a\to b)$, (ii) if tied, $-\fracIndel(a\to b)$, (iii) if still tied, $\nu_\calB(b)$, all subject to $\fracIdentity(a\to b) \geq \alpha$ and $\fracIndel(a\to b) \leq \beta$. If the feasible set is empty, let $b^*_3(a) = \varnothing$.
- For each $a\in\calA$, let $b^*_4(a)$ be the element of $\calB$ that maximizes (i) $\nu_\calB(b)$, (i) if tied, $\fracIdentity(a\to b)$, (iii) if still tied, $-\fracIndel(a\to b)$, all subject to $\fracIdentity(a\to b) \geq \alpha$ and $\fracIndel(a\to b) \leq \beta$. If the feasible set is empty, let $b^*_4(a) = \varnothing$.
For each $b\in\calB$, the mask $\mask_i(b)$ is defined as follows:
- $\mask_i(b)$ is a string of equal length to $b$, where $\mask_i(b)[l] = 1$ if there exists $a\in A$ such that (i) $b = b^*_i(a)$, (ii) the alignment $a \to b$ takes $k \mapsto l$ for some index $0\leq k\leq\length(a)$, (iii) $a[k] = b[l]$, and (iv) $b[l] \neq \mathtt{N}$; and $\mask_i(b)[l] = 0$ otherwise.
The function $\fracOnes$ is defined as follows:
- $\numOnes(\mask_i(b)) = \sum_{l = 1}^{\length(b)} \mask_i(b)$.
- $\length'(b) = \sum_{l=1}^{\length(b)} 1(b[l] \neq \mathtt{N})$.
- $\fracOnes(\mask_i(b)) = \numOnes(\mask_i(b)) / \length'(b)$.
The weighted_2_transcript_$i$_{precision, recall, F1}_at_$\alpha$_and_$\beta$ are computed for $i\in\{3,4\}$ as for $i\in\{1,2\}$ above.
Finally, we can consider an alternative definition of "probability", focused
on nucleotide-level expression instead of transcript-level expression. Let
$(a,k)\in\calA\times\N$ refer to contig $a$'s $k$th base, and define
$(b,l)\in\calB\times\N$ similarly.
- The probability that an assembly nucleotide matches an oracleset nucleotide is defined to be
$P((A,K) \in \proj_{\calA\times\N}(\matches(\calA\times\N, \calB\times\N)))$, where $(A,K)$ is a a random element of $\calA\times\N$.
- The probability that an oraclset nucleotide matches an assembly nucleotide is defined to be
$P((B,L) \in \proj_{\calB\times\N}(\matches(\calA\times\N, \calB\times\N)))$, where $(B,L)$ is a a random element of $\calB\times\N$.
The distribution of $(A,K)$ and $(B,L)$ is given by:
- $P(A=a, K=k) = \nu_\calA(a)/\length(a)$ for all $a\in\calA$.
- $P(B=b, L=l) = \nu_\calB(b)/\length(b)$ for all $b\in\calB$.
where $\tau_\calA$ and $\tau_\calB$ are the transcript-level expression levels,
according to RSEM.
We define $\matches$ as follows. For each $i\in\{3,4\}$, let
- $\matches_{i,\alpha,\beta}(\calA\times\N, \calB\times\N) = \{((a,k),(b^*_i(a),l)) : (a,b^*_i(a))\in\matches_{i,\alpha,\beta}(\calA,\calB), a[k] = b[l], b[l] \neq \mathtt{N}\}$.
We also define unmatched versions. Let
- $\matches_{1,\alpha,\beta}(\calA\times\N, \calB\times\N) =
\{((a,k),(b,l)) \in(\calA\times\N)\times(\calB\times\N) : \fracIdentity(a\to b) \geq \alpha, \fracIndel(a\to b) \leq \beta, (a\to b)(k) = l, a[k] = b[l], b[l] \neq \mathtt{N}\}$.
- $\matches_{2,\alpha,\beta}(\calA\times\N, \calB\times\N) =
\{((a,k),(b,l)) \in(\calA\times\N)\times(\calB\times\N) : \fracIdentity(b\to a) \geq \alpha, \fracIndel(b\to a) \leq \beta, (b\to a)(l) = k, a[k] = b[l], b[l] \neq \mathtt{N}\}$.
This gives us, for $i\in\{1,2,3,4\}$:
- weighted_2_nucleotide_$i$_precision_at_$\alpha$_and_$\beta = \sum_{a\in\calA} \sum_{k=1}^{\length(a)} 1\left((a,k)\in\proj_{\calA\times\N}(\matches_{i,\alpha,\beta}(\calA\times\N,\calB\times\N))\right) \nu_\calA(a)/\length(a)$
- weighted_2_nucleotide_$i$_recall_at_$\alpha$_and_$\beta = \sum_{b\in\calB} \sum_{l=1}^{\length(b)} 1\left((b,l)\in\proj_{\calB\times\N}(\matches_{i,\alpha,\beta}(\calA\times\N,\calB\times\N))\right) \nu_\calB(b)/\length(b)$
Computation of the above
Main procedure:
- Load $\calA$ and $\calB$.
- Load $\tau_\calA$ and $\tau_\calB$.
- Compute $b_i^*(a)$.
- Compute $\mask_i(b)$.
- Compute $\matches_{i,\alpha,\beta}(\calA,\calB)$.
- Compute $\matches_{i,\alpha,\beta}(\calA\times\N,\calB\times\N)$.
- Compute weighted_2_transcript_i_{precision, recall, F1}_at_$\alpha$_and_$\beta$.
- Compute weighted_2_nucleotide_i_{precision, recall, F1}_at_$\alpha$_and_$\beta$.
Procecure to compute $b_i^*(a)$:
- Let $\compare_3(a, b_1, b_2)$ return $b_1$ or $b_2$ so as to maximize (i) $\fracIdentity(a\to b)$, (ii) if tied, $-\fracIndel(a\to b)$, (iii) if still tied, $\nu_\calB(b)$.
- Let $\compare_4(a, b_1, b_2)$ return $b_1$ or $b_2$ so as to maximize (i) $\nu_\calB(b)$, (i) if tied, $\fracIdentity(a\to b)$, (iii) if still tied, $-\fracIndel(a\to b)$.
- Init $b_i^*(a) = \varnothing$.
- For each alignment $a\to b$:
- If $\fracIdentity(a\to b) \geq \alpha$ and $\fracIndel(a\to b) \leq \beta$:
- Let $b_i^*(a) = \compare_i(a, b_i^*(a), b)$.
Procecure to compute $\mask_i(b)$:
- Init $\mask_i(b)$ as a string of $\length(b)$ 0's.
- For each alignment $a\to b$:
- If $b = b_i^*(a)$:
- For $(k,l) \in a\to b$:
- If $a[k] = b[l]$ and $b[l] \neq \mathtt{N}$:
Procedure to compute $\matches_1(\calA,\calB)$:
- Init $\proj_\calA \matches_1(\calA,\calB) = \varnothing$.
- Init $\proj_\calB \matches_1(\calA,\calB) = \varnothing$.
- For each alignment $a\to b$:
- If $\fracIdentity(a\to b) \geq \alpha$ and $\fracIndel(a\to b) \leq \beta$:
- Add $a$ to $\proj_\calA \matches_1(\calA,\calB)$.
- Add $b$ to $\proj_\calB \matches_1(\calA,\calB)$.
Procedure to compute $\matches_2(\calA,\calB)$:
- Init $\proj_\calA \matches_2(\calA,\calB) = \varnothing$.
- Init $\proj_\calB \matches_2(\calA,\calB) = \varnothing$.
- For each alignment $b\to a$:
- If $\fracIdentity(b\to a) \geq \alpha$ and $\fracIndel(b\to a) \leq \beta$:
- Add $a$ to $\proj_\calA \matches_2(\calA,\calB)$.
- Add $b$ to $\proj_\calB \matches_2(\calA,\calB)$.
Procedure to compute $\matches_i(\calA,\calB)$ for $i\in\{3,4\}$:
- Init $\proj_\calA \matches_i(\calA,\calB) = \varnothing$.
- Init $\proj_\calB \matches_i(\calA,\calB) = \varnothing$.
- For each alignment $a\to b$:
- If $b = b_i^*(a)$ and $\fracOnes(\mask_i(b)) \geq \alpha$:
- Add $a$ to $\proj_\calA \matches_i(\calA,\calB)$.
- Add $b$ to $\proj_\calB \matches_i(\calA,\calB)$.
Procedure to compute $\matches_i(\calA\times\N,\calB\times\N)$ for $i\in\{3,4\}$:
- Init $\proj_{\calA\times\N} \matches_i(\calA\times\N,\calB\times\N) = \varnothing$.
- Init $\proj_{\calB\times\N} \matches_i(\calA\times\N,\calB\times\N) = \varnothing$.
- For each alignment $a\to b$:
- If $b = b_i^*(a)$ and $\fracOnes(\mask_i(b)) \geq \alpha$, i.e., if $(a,b_i^*(a))\in\matches_i(\calA,\calB)$:
- For $(k,l) \in a\to b$:
- If $a[k] = b[l]$ and $b[l] \neq \mathtt{N}$:
- Add $(a,k)$ to $\proj_{\calA\times\N} \matches_i(\calA\times\N,\calB\times\N)$.
- Add $(b,l)$ to $\proj_{\calB\times\N} \matches_i(\calA\times\N,\calB\times\N)$.
Procedure to compute weighted_2_transcript_i_{precision, recall, F1}_at_$\alpha$_and_$\beta$, for $i\in\{1,2,3,4\}$:
- weighted_2_transcript_$i$_precision_at_$\alpha$_and_$\beta = \sum_{a\in\proj_\calA(\matches_{i,\alpha,\beta}(\calA,\calB))} \tau_\calA(a)$
- weighted_2_transcript_$i$_recall_at_$\alpha$_and_$\beta = \sum_{b\in\proj_\calB(\matches_{i,\alpha,\beta}(\calA,\calB))} \tau_\calB(b)$
- weighted_2_transcript_$i$_F1_at_$\alpha$_and_$\beta$ = harmonic mean
Procedure to compute weighted_2_nucleotide_i_{precision, recall, F1}_at_$\alpha$_and_$\beta$, for $i\in\{1,2,3,4\}$:
- weighted_2_nucleotide_$i$_precision_at_$\alpha$_and_$\beta = \sum_{(a,k)\in\proj_{\calA\times\N}(\matches_{i,\alpha,\beta}(\calA\times\N,\calB\times\N))} \nu_\calA(a)/\length(a)$
- weighted_2_nucleotide_$i$_recall_at_$\alpha$_and_$\beta = \sum_{(b,l)\in\proj_{\calB\times\N}(\matches_{i,\alpha,\beta}(\calA\times\N,\calB\times\N))} \nu_\calB(b)/\length(b)$
- weighted_2_nucleotide_$i$_F1_at_$\alpha$_and_$\beta$ = harmonic mean
Precision-recall statistics
Let $A$ be an assembly and $B$ be an oracleset. By $s\in A$, we mean that $s$ is a contig in $A$. By $s[i]$ we mean the $i$th base in $s$.
For any alignment $q\to t$, define:
- length($q$) = the number of bases in $q$, including $\mathtt N$'s.
- length'($q$) = the number of non-$\mathtt N$ bases in $q$.
- frac_identity($q\to t$) = num_identity($q\to t$)/length'($q$).
- frac_indel($q\to t$) = num_indel($q\to t$)/length'($q$).
- $\nu_A(q)$ = nucleotide-level expression of $q$ within $A$ according to RSEM.
- $\tau_A(q)$ = transcript-level expression of $q$ within $A$ according to RSEM.
Fix min_frac_identity parameter $\alpha$ $\geq$ 0.8. If desired, fix
max_frac_indel parameter $\beta$.
Nucleotide-level statistics
We use the following procedure to compute nucleotide-level precision and
recall statistics.
- If $\beta$ is not given, set $\beta=1$.
- Run blat -minIdentity=80 B.fa A.fa A_to_B.psl to align the
sequences in $A$ to those in $B$, excluding those alignments with less than 80%
identity according to blat.
- For each $q\in A$, find the element $t^*\in B$ such that $q \to t^*$ is a
better alignment than $q \to t$ for all other $t\in B$. We do this as follows,
letting $t^*(q)$ denote the best $t$ found for $q$ so far, initialized as
$\varnothing$.
- For each alignment $q \to t$ where $q\in A$ and $t\in B$:
- If frac_identity($q\to t$) < $\alpha$
or frac_indel($q\to t$) > $\beta$, then skip this alignment.
- If $t^*(q) = \varnothing$
or frac_identity($q\to t$) > frac_identity($q\to t^*(q)$)
or (frac_identity($q\to t$) = frac_identity($q\to t^*(q)$)
and frac_indel($q\to t$) < frac_indel($q\to t^*(q)$)),
then set $t^*(q) = t$.
- Compute the following quantities based on the alignments $\{q \to t^*(q) : q\in A, t^*(q) \neq \varnothing\}$:
- num_mismatching_bases: Total number of mismatching (nonidentical) bases
in the alignments. If this is too high,
something might be wrong.
- num_raw_matching_bases: Total number of matching (identical) bases in
all the alignments, with bases counted multiple
times if they are involved in multiple
alignments.
- num_uniq_matching_bases: Total number of bases in B that match at least
one aligned base in A. Each base in B is counted
only once, no matter how many bases in A it
aligns to.
We compute these quantities as follows:
- For each $q\in A$, initialize frac_identity($q$) = 0.
- For each $t\in B$, initialize mask($t$) as a string of all 0's, the same length as $t$.
- Initialize num_raw_matching_bases = num_uniq_matching_bases = num_mismatching_bases = 0.
- For each alignment $q \to t$ where $q\in A$ and $t\in B$:
- If $t \neq t^*(q)$, skip this alignment.
- Otherwise: We can think of the alignment $q\to t$ as a set of pairs
$(k,l)$ where $k$ indexes $q$ and $l$ indexes $t$. For each pair $(k,l)$ in
the alignment:
- If $q[k] = t[l]$ and $t[l] \neq \mathtt N$:
- Increment num_raw_matching_bases.
- If mask($t$)$[l] = 0$:
- Increment num_uniq_matching_bases.
- Set mask($t$)$[l] = 1$.
- Else: Increment num_mismatching_bases.
Set frac_identity($q$) = frac_identity($q \to t$).
NLS:
- Let num_bases_in_$A = \sum_{q\in A}$ length($q$) and
num_bases_in_$B = \sum_{t\in B}$ length($t$).
- Let nucl_precision_at_$\alpha$_and_$\beta$ = num_uniq_matching_bases / num_bases_in_$A$.
- Let nucl_recall_at_$\alpha$_and_$\beta$ = num_uniq_matching_bases / num_bases_in_$B$.
WNLS:
- Let frac_identity($t$) = $\sum_{l=1}^{\mathrm{length}(t)}$ mask($t$)$[l]$ / length($t$), for each $t\in B$.
- Let weighted_nucl_precision_at_$\alpha$_and_$\beta$ = $\sum_{q \in A}$ frac_identity($q$) $\nu_A$($q$).
- Let weighted_nucl_recall_at_$\alpha$_and_$\beta$ = $\sum_{t \in B}$ frac_identity($t$) $\nu_B$($t$).
Transcript-level statistics
We use the following procedure to compute transcript-level precision and
recall statistics:
- If $\beta$ is not given, set $\beta=1$.
- Run blat -minIdentity=80 A.fa B.fa B_to_A.psl to align the
sequences in $B$ to those in $A$, excluding those alignments with less than 80%
identity according to blat.
- Create a bipartite graph in which
- The vertices are the elements of $B$ and $A$.
- There is an edge between $q\in B$ and $t\in A$ if there is a good enough alignment $q\to t$.
We do this as follows:
- For each alignment $q \to t$ where $q\in B$ and $t\in A$:
- If frac_identity($q\to t$) < $\alpha$ or
if frac_indel($q\to t$) > $\beta$, then skip this alignment.
- Otherwise: add an edge between $q$ and $t$.
- Compute the maximum cardinality matching of the above bipartite graph. This is done using the LEMON library.
TLS1:
- Let $m$ be the cardinality of the maximum cardinality matching.
- Let num_contigs_in_$A = |A|$ and
num_contigs_in_$B = |B|$.
- Let transcript_precision_at_$\alpha$_and_$\beta$ = $m$ / num_contigs_in_$A$.
- Let transcript_recall_at_$\alpha$_and_$\beta$ = $m$ / num_contigs_in_$B$.
WTLS1:
- Let $E$ be the set of edges in the maximum cardinality matching. We think
of $E$ as a set of pairs $(q, t)$ where $q\in B$ and $t\in A$.
- Let weighted_transcript_precision_at_$\alpha$_and_$\beta$ = $\sum_{(q,t) \in E} \tau_A(t)$.
- Let weighted_transcript_recall_at_$\alpha$_and_$\beta$ = $\sum_{(q,t) \in E} \tau_B(q)$.
Transcript_2-level statistics
We use the following procedure to compute transcript_2-level precision and
recall statistics:
- If $\beta$ is not given, set $\beta=1$.
- Run blat -minIdentity=80 B.fa A.fa A_to_B.psl to align the
sequences in $A$ to those in $B$, excluding those alignments with less than 80%
identity according to blat.
- Create a bipartite graph in which
- The vertices are the elements of $A$ and $B$.
- There is an edge between $q\in A$ and $t\in B$ if there is a good enough alignment $q\to t$.
We do this as follows:
- For each alignment $q \to t$ where $q\in A$ and $t\in B$:
- If frac_identity($q\to t$) < $\alpha$ or
if frac_indel($q\to t$) > $\beta$, then skip this alignment.
- Otherwise: add an edge between $q$ and $t$.
- Compute the maximum cardinality matching of the above bipartite graph. This is done using the LEMON library.
TLS2:
- Let $m$ be the cardinality of the maximum cardinality matching.
- Let num_contigs_in_$A = |A|$ and
num_contigs_in_$B = |B|$.
- Let transcript_2_precision_at_$\alpha$_and_$\beta$ = $m$ / num_contigs_in_$A$.
- Let transcript_2_recall_at_$\alpha$_and_$\beta$ = $m$ / num_contigs_in_$B$.
WTLS2:
- Let $E$ be the set of edges in the maximum cardinality matching. We think
of $E$ as a set of pairs $(q, t)$ where $q\in A$ and $t\in B$.
- Let weighted_transcript_2_precision_at_$\alpha$_and_$\beta$ = $\sum_{(q,t) \in E} \tau_A(q)$.
- Let weighted_transcript_2_recall_at_$\alpha$_and_$\beta$ = $\sum_{(q,t) \in E} \tau_B(t)$.
Transcript_3-level statistics
We use the following procedure to compute transcript_3-level precision and
recall statistics. We reuse the mask($t$) vectors and the frac_identity
vector created to compute the nucleotide-level statistics.
- If $\beta$ is not given, set $\beta=1$.
- Initialize num_covered_masks to 0.
- Initialize is_covered($t$) = 0 for each $t\in B$.
- Initialize is_covered($q$) = 0 for each $q\in A$.
- For each $t \in B$:
- Let num_ones be the number of '1' entries in mask($t$).
- Let frac_ones = num_ones / length($t$).
- If frac_ones $\geq$ $\alpha$, increment num_covered_masks and set is_covered($t$) = 1.
- For each $q \in A$: If frac_identity($q$) $\geq$ $\alpha$, set
is_covered($q$) = 1.
TLS3:
- Let num_contigs_in_$A = |A|$ and
num_contigs_in_$B = |B|$.
- Let transcript_3_precision_at_$\alpha$_and_$\beta$ = num_covered_masks / num_contigs_in_$A$.
- Let transcript_3_recall_at_$\alpha$_and_$\beta$ = num_covered_masks / num_contigs_in_$B$.
WLS3:
- Let weighted_transcript_3_precision_at_$\alpha$_and_$\beta$ = $\sum_{q\in A}$ is_covered($q$) $\tau_A(q)$.
- Let weighted_transcript_3_recall_at_$\alpha$_and_$\beta$ = $\sum_{t\in B}$ is_covered($t$) $\tau_B(t)$.
Motivation for the weighted variants above
For the unweighted versions, we defined:
- transcript precision = number of matches between contigs and oracleset elements / total number of contigs.
- transcript recall = number of matches between contigs and oracleset elements / total number of oracleset elements.
One can interpret these definitions as follows
- transcript precision = probability that a randomly selected contig matches an oracleset element, under a uniform distribution over contigs.
- transcript recall = probability that a randomly selected oracleset element matches a contig, under a uniform distribution over oracleset elements.
For the weighted versions, we instead define:
- transcript precision = probability that a randomly selected contig matches an oracleset element, under the distribution over contigs determined by rsem-calculate-expression.
- transcript recall = probability that a randomly selected oracleset element matches a contig, under the distribution over oracleset elements determined by rsem-calculate-expression.
In other words (using notation from the next subsection):
- transcript precision = $\sum_{s \in A} \tau_A(s) 1(s$ matches $t$ for some $t\in B$).
- transcript recall = $\sum_{t \in B} \tau_B(t) 1(s$ matches $t$ for some $s\in A$).
Here "matches" is defined differently for the transcript, transcript_2, and
transcript_3 statistics.
Similarly, for the unweighted versions, we defined:
- nucleotide precision = number of matches between assembly bases and oracleset bases / total number of assembly bases.
- nucleotide recall = number of matches between assembly bases and oracleset bases / total number of oracleset bases.
One can interpret these definitions as follows
- nucleotide precision = probability that a randomly selected assembly base matches an oracleset base, under a uniform distribution over assembly bases.
- nucleotide recall = probability that a randomly selected oracleset base matches an assembly base, under a uniform distribution over oracleset bases.
For the weighted versions, we instead define:
- nucleotide precision = probability that a randomly selected assembly base matches an oracleset base, under the distribution over assembly bases determined by rsem-calculate-expression.
- nucleotide recall = probability that a randomly selected oracleset base matches a assembly base, under the distribution over oracleset bases determined by rsem-calculate-expression.
RSEM approx columns
These columns come from expression.approx in RSEM's output.
rsem.approx.approx |
I believe that this is the log model evidence, log P(D), computed using a convex approximation. |
rsem.approx.bic |
Log model evidence, log P(D), computed using BIC. |
rsem.approx.loglikelihood |
I believe that this is the log likelihood, log P(D|\theta), computed at the MAP \theta. |
rsem.approx.loglikelihood.penalty |
This is the BIC penalty. |
It should be the case that rsem.approx.bic = rsem.approx.loglikelihood - rsem.approx.loglikelihood.penalty.
RSEM eval columns
These columns come from expression.eval in RSEM's output.
rsem.eval.lognumer.minus.logdenom |
I believe that this is \log P(D|\theta') + \log P(\theta') - \log
P(\theta'|D), where \theta' is a posterior mean estimate |
rsem.eval.logprior |
I believe that this is \log P(\theta') |
rsem.eval.loglikelihood |
I believe that this is \log P(D|\theta') |
rsem.eval.logdenom |
I believe that this is \log P(\theta'|D) |
RSEM prior columns
These columns come from expression.prior in RSEM's output.
rsem.prior.log.prob.M |
This is \log P(M). |
rsem.prior.log.prob.L.given.M |
This is \log P(L|M). |
rsem.prior.log.prob.Sequences.given.L.and.M |
This is \log P(Sequences|L, M). |
rsem.prior.log.prob.A |
This is \log P(A) = \log P(M) + \log P(L|M) + \log P(Sequences|L, M). |
rsem.eval.loglikelihood.plus.rsem.prior.log.prob.A |
This is \log P(A) + rsem.eval.loglikelihood |
rsem.approx.loglikelihood.plus.rsem.prior.log.prob.A |
This is \log P(A) + rsem.approx.loglikelihood |
rsem.approx.approx.plus.rsem.prior.log.prob.A |
This is \log P(A) + rsem.approx.approx |
rsem.approx.bic.plus.rsem.prior.log.prob.A |
This is \log P(A) + rsem.approx.bic |
RSEM ss columns
These columns come from expression.ss in RSEM's output.
rsem.ss.mean.num.reads.per.transcript |
Mean number of reads aligning to each transcript (based on countvs). |
rsem.ss.median.num.reads.per.transcript |
Median number of reads aligning to each transcript (based on countvs). |
rsem.ss.num.transcripts.with.zero.reads |
Number of transcripts with no reads aligning to them (based on countvs). |
rsem.ss.num.matching.bases |
Number of matching bases, based on the qpro profiles. |
rsem.ss.num.mismatching.bases |
Number of mismatching bases, based on the qpro profiles. |