$
\def\match{\mathrm{match}}
\def\matches{\mathrm{matches}}
\def\leftmatches{\mathrm{left\_matches}}
\def\proj{\mathrm{proj}}
\def\calA{\mathcal A}
\def\calB{\mathcal B}
\def\fracIdentity{\mathrm{frac\_identity}}
\def\fracIndel{\mathrm{frac\_indel}}
\def\mask{\mathrm{mask}}
\def\fracOnes{\mathrm{frac\_ones}}
\def\numOnes{\mathrm{num\_ones}}
\def\length{\mathrm{length}}
\def\N{\mathbb{N}}
\def\precision{\mathrm{precision}}
\def\recall{\mathrm{recall}}
\def\F1{\mathrm{F1}}
\def\ok{\mathrm{ok}}
\def\compare{\mathrm{compare}}
$
Better weighted precision and recall statistics
In general, we may define
- precision = probability that an assembly element matches an oracleset element, where assembly elements are considered random and oracleset elements are not.
- recall = probability that an oracleset element matches an assembly element, where oracleset elements are considered random and assembly elements are not.
To get a specific precision-recall pair, we need to define "probability" and "matches".
We consider the following transcript-level definitions of "probability".
- The probability that an assembly element matches an oracleset element is defined to be
$P(A \in \proj_{\calA}(\matches(\calA, \calB)))$, where $A$ is a random variable taking values in $\calA$.
- The probability that an oraclset element matches an assembly element is defined to be
$P(B \in \proj_{\calB}(\matches(\calA, \calB)))$, where $B$ is a random variable taking values in $\calB$.
Here $\proj_\calA(\mathcal S) = \cup_{t\in B} \{s \in \calA : (s,t) \in
\mathcal S\}$, $\proj_\calB$ is defined similarly, and $\matches$ is defined
below. The distribution of $A$ and $B$ is given by:
- $P(A = a) = \tau_\calA(a)$ for all $a\in\calA$.
- $P(B = b) = \tau_\calB(b)$ for all $b\in\calB$.
where $\tau_\calA$ and $\tau_\calB$ are the transcript-level expression levels,
according to RSEM.
We consider the following definitions of "matches".
- $\matches_{1,\alpha,\beta}(\calA, \calB) = \{(a,b) \in \calA\times\calB : \fracIdentity(a\to b) \geq \alpha, \fracIndel(a\to b) \leq \beta\}$, i.e., $(a,b)\in\calA\times\calB$ are a match if $a$ aligns to $b$ sufficiently well.
- $\matches_{2,\alpha,\beta}(\calA, \calB) = \{(a,b) \in \calA\times\calB : \fracIdentity(b\to a) \geq \alpha, \fracIndel(b\to a) \leq \beta\}$, i.e., $(a,b)\in\calA\times\calB$ are a match if $b$ aligns to $a$ sufficiently well.
The above definitions are used to compute
weighted_2_transcript_1_{precision, recall, F1}_at_$\alpha$_and_$\beta$ and
weighted_2_transcript_2_{precision, recall, F1}_at_$\alpha$_and_$\beta$,
as follows:
- weighted_2_transcript_$i$_precision_at_$\alpha$_and_$\beta = \sum_{a\in\calA} 1\left(a\in\proj_\calA(\matches_{i,\alpha,\beta}(\calA,\calB))\right) \tau_\calA(a)$
- weighted_2_transcript_$i$_recall_at_$\alpha$_and_$\beta = \sum_{b\in\calB} 1\left(b\in\proj_\calB(\matches_{i,\alpha,\beta}(\calA,\calB))\right) \tau_\calB(b)$
With these definitions, the precision and recall measure how well individual
oracleset elements are recovered by individual assembly elements.
We may also consider the following definitions of "matches", which lead to
precision and recall statistics that measure how well individual oracleset
elements are recovered by collections of (possibly) several assembly elements.
- $\matches_{3,\alpha,\beta}(\calA, \calB) = \{(a,b^*_3(a)) : a\in\calA, b^*_3(a) \neq \varnothing, \fracOnes(\mask_3(b^*_3(a))) \geq \alpha\}$.
- $\matches_{4,\alpha,\beta}(\calA, \calB) = \{(a,b^*_4(a)) : a\in\calA, b^*_4(a) \neq \varnothing, \fracOnes(\mask_4(b^*_4(a))) \geq \alpha\}$.
The elements $b^*_i(a)$ are defined as follows:
- For each $a\in\calA$, let $b^*_3(a)$ be the element of $\calB$ that maximizes (i) $\fracIdentity(a\to b)$, (ii) if tied, $-\fracIndel(a\to b)$, (iii) if still tied, $\nu_\calB(b)$, all subject to $\fracIdentity(a\to b) \geq \alpha$ and $\fracIndel(a\to b) \leq \beta$. If the feasible set is empty, let $b^*_3(a) = \varnothing$.
- For each $a\in\calA$, let $b^*_4(a)$ be the element of $\calB$ that maximizes (i) $\nu_\calB(b)$, (i) if tied, $\fracIdentity(a\to b)$, (iii) if still tied, $-\fracIndel(a\to b)$, all subject to $\fracIdentity(a\to b) \geq \alpha$ and $\fracIndel(a\to b) \leq \beta$. If the feasible set is empty, let $b^*_4(a) = \varnothing$.
For each $b\in\calB$, the mask $\mask_i(b)$ is defined as follows:
- $\mask_i(b)$ is a string of equal length to $b$, where $\mask_i(b)[l] = 1$ if there exists $a\in A$ such that (i) $b = b^*_i(a)$, (ii) the alignment $a \to b$ takes $k \mapsto l$ for some index $0\leq k\leq\length(a)$, (iii) $a[k] = b[l]$, and (iv) $b[l] \neq \mathtt{N}$; and $\mask_i(b)[l] = 0$ otherwise.
The function $\fracOnes$ is defined as follows:
- $\numOnes(\mask_i(b)) = \sum_{l = 1}^{\length(b)} \mask_i(b)$.
- $\length'(b) = \sum_{l=1}^{\length(b)} 1(b[l] \neq \mathtt{N})$.
- $\fracOnes(\mask_i(b)) = \numOnes(\mask_i(b)) / \length'(b)$.
The weighted_2_transcript_$i$_{precision, recall, F1}_at_$\alpha$_and_$\beta$ are computed for $i\in\{3,4\}$ as for $i\in\{1,2\}$ above.
Finally, we can consider an alternative definition of "probability", focused
on nucleotide-level expression instead of transcript-level expression. Let
$(a,k)\in\calA\times\N$ refer to contig $a$'s $k$th base, and define
$(b,l)\in\calB\times\N$ similarly.
- The probability that an assembly nucleotide matches an oracleset nucleotide is defined to be
$P((A,K) \in \proj_{\calA\times\N}(\matches(\calA\times\N, \calB\times\N)))$, where $(A,K)$ is a a random element of $\calA\times\N$.
- The probability that an oraclset nucleotide matches an assembly nucleotide is defined to be
$P((B,L) \in \proj_{\calB\times\N}(\matches(\calA\times\N, \calB\times\N)))$, where $(B,L)$ is a a random element of $\calB\times\N$.
The distribution of $(A,K)$ and $(B,L)$ is given by:
- $P(A=a, K=k) = \nu_\calA(a)/\length(a)$ for all $a\in\calA$.
- $P(B=b, L=l) = \nu_\calB(b)/\length(b)$ for all $b\in\calB$.
where $\tau_\calA$ and $\tau_\calB$ are the transcript-level expression levels,
according to RSEM.
We define $\matches$ as follows. For each $i\in\{3,4\}$, let
- $\matches_{i,\alpha,\beta}(\calA\times\N, \calB\times\N) = \{((a,k),(b^*_i(a),l)) : (a,b^*_i(a))\in\matches_{i,\alpha,\beta}(\calA,\calB), a[k] = b[l], b[l] \neq \mathtt{N}\}$.
This gives us
weighted_2_nucleotide_3_{precision, recall, F1}_at_$\alpha$_and_$\beta$ and
weighted_2_nucleotide_4_{precision, recall, F1}_at_$\alpha$_and_$\beta$, as follows:
- weighted_2_nucleotide_$i$_precision_at_$\alpha$_and_$\beta = \sum_{a\in\calA} \sum_{k=1}^{\length(a)} 1\left((a,k)\in\proj_{\calA\times\N}(\matches_{i,\alpha,\beta}(\calA\times\N,\calB\times\N))\right) \nu_\calA(a)/\length(a)$
- weighted_2_nucleotide_$i$_recall_at_$\alpha$_and_$\beta = \sum_{b\in\calB} \sum_{l=1}^{\length(b)} 1\left((b,l)\in\proj_{\calB\times\N}(\matches_{i,\alpha,\beta}(\calA\times\N,\calB\times\N))\right) \nu_\calB(b)/\length(b)$
Computation of the above
Main procedure:
- Load $\calA$ and $\calB$.
- Load $\tau_\calA$ and $\tau_\calB$.
- Compute $b_i^*(a)$.
- Compute $\mask_i(b)$.
- Compute $\matches_{i,\alpha,\beta}(\calA,\calB)$.
- Compute $\matches_{i,\alpha,\beta}(\calA\times\N,\calB\times\N)$.
- Compute weighted_2_transcript_i_{precision, recall, F1}_at_$\alpha$_and_$\beta$.
- Compute weighted_2_nucleotide_i_{precision, recall, F1}_at_$\alpha$_and_$\beta$.
Procecure to compute $b_i^*(a)$:
- Let $\compare_3(a, b_1, b_2)$ return $b_1$ or $b_2$ so as to maximize (i) $\fracIdentity(a\to b)$, (ii) if tied, $-\fracIndel(a\to b)$, (iii) if still tied, $\nu_\calB(b)$.
- Let $\compare_4(a, b_1, b_2)$ return $b_1$ or $b_2$ so as to maximize (i) $\nu_\calB(b)$, (i) if tied, $\fracIdentity(a\to b)$, (iii) if still tied, $-\fracIndel(a\to b)$.
- Init $b_i^*(a) = \varnothing$.
- For each alignment $a\to b$:
- If $\fracIdentity(a\to b) \geq \alpha$ and $\fracIndel(a\to b) \leq \beta$:
- Let $b_i^*(a) = \compare_i(a, b_i^*(a), b)$.
Procecure to compute $\mask_i(b)$:
- Init $\mask_i(b)$ as a string of $\length(b)$ 0's.
- For each alignment $a\to b$:
- If $b = b_i^*(a)$:
- For $(k,l) \in a\to b$:
- If $a[k] = b[l]$ and $b[l] \neq \mathtt{N}$:
Procedure to compute $\matches_1(\calA,\calB)$:
- Init $\proj_\calA \matches_1(\calA,\calB) = \varnothing$.
- Init $\proj_\calB \matches_1(\calA,\calB) = \varnothing$.
- For each alignment $a\to b$:
- If $\fracIdentity(a\to b) \geq \alpha$ and $\fracIndel(a\to b) \leq \beta$:
- Add $a$ to $\proj_\calA \matches_1(\calA,\calB)$.
- Add $b$ to $\proj_\calB \matches_1(\calA,\calB)$.
Procedure to compute $\matches_2(\calA,\calB)$:
- Init $\proj_\calA \matches_2(\calA,\calB) = \varnothing$.
- Init $\proj_\calB \matches_2(\calA,\calB) = \varnothing$.
- For each alignment $b\to a$:
- If $\fracIdentity(b\to a) \geq \alpha$ and $\fracIndel(b\to a) \leq \beta$:
- Add $a$ to $\proj_\calA \matches_2(\calA,\calB)$.
- Add $b$ to $\proj_\calB \matches_2(\calA,\calB)$.
Procedure to compute $\matches_i(\calA,\calB)$ for $i\in\{3,4\}$:
- Init $\proj_\calA \matches_i(\calA,\calB) = \varnothing$.
- Init $\proj_\calB \matches_i(\calA,\calB) = \varnothing$.
- For each alignment $a\to b$:
- If $b = b_i^*(a)$ and $\fracOnes(\mask_i(b)) \geq \alpha$:
- Add $a$ to $\proj_\calA \matches_i(\calA,\calB)$.
- Add $b$ to $\proj_\calB \matches_i(\calA,\calB)$.
Procedure to compute $\matches_i(\calA\times\N,\calB\times\N)$ for $i\in\{3,4\}$:
- Init $\proj_{\calA\times\N} \matches_i(\calA\times\N,\calB\times\N) = \varnothing$.
- Init $\proj_{\calB\times\N} \matches_i(\calA\times\N,\calB\times\N) = \varnothing$.
- For each alignment $a\to b$:
- If $b = b_i^*(a)$ and $\fracOnes(\mask_i(b)) \geq \alpha$, i.e., if $(a,b_i^*(a))\in\matches_i(\calA,\calB)$:
- For $(k,l) \in a\to b$:
- If $a[k] = b[l]$ and $b[l] \neq \mathtt{N}$:
- Add $(a,k)$ to $\proj_{\calA\times\N} \matches_i(\calA\times\N,\calB\times\N)$.
- Add $(b,l)$ to $\proj_{\calB\times\N} \matches_i(\calA\times\N,\calB\times\N)$.
Procedure to compute weighted_2_transcript_i_{precision, recall, F1}_at_$\alpha$_and_$\beta$, for $i\in\{1,2,3,4\}$:
- weighted_2_transcript_$i$_precision_at_$\alpha$_and_$\beta = \sum_{a\in\proj_\calA(\matches_{i,\alpha,\beta}(\calA,\calB))} \tau_\calA(a)$
- weighted_2_transcript_$i$_recall_at_$\alpha$_and_$\beta = \sum_{b\in\proj_\calB(\matches_{i,\alpha,\beta}(\calA,\calB))} \tau_\calB(b)$
- weighted_2_transcript_$i$_F1_at_$\alpha$_and_$\beta = harmonic mean
Procedure to compute weighted_2_nucleotide_i_{precision, recall, F1}_at_$\alpha$_and_$\beta$, for $i\in\{1,2,3,4\}$:
- weighted_2_nucleotide_$i$_precision_at_$\alpha$_and_$\beta = \sum_{(a,k)\in\proj_{\calA\times\N}(\matches_{i,\alpha,\beta}(\calA\times\N,\calB\times\N))} \nu_\calA(a)/\length(a)$
- weighted_2_nucleotide_$i$_recall_at_$\alpha$_and_$\beta = \sum_{(b,l)\in\proj_{\calB\times\N}(\matches_{i,\alpha,\beta}(\calA\times\N,\calB\times\N))} \nu_\calB(b)/\length(b)$
- weighted_2_nucleotide_$i$_F1_at_$\alpha$_and_$\beta = harmonic mean