$ \def\match{\mathrm{match}} \def\matches{\mathrm{matches}} \def\leftmatches{\mathrm{left\_matches}} \def\proj{\mathrm{proj}} \def\calA{\mathcal A} \def\calB{\mathcal B} \def\fracIdentity{\mathrm{frac\_identity}} \def\fracIndel{\mathrm{frac\_indel}} \def\mask{\mathrm{mask}} \def\fracOnes{\mathrm{frac\_ones}} \def\numOnes{\mathrm{num\_ones}} \def\length{\mathrm{length}} \def\N{\mathbb{N}} \def\precision{\mathrm{precision}} \def\recall{\mathrm{recall}} \def\F1{\mathrm{F1}} \def\ok{\mathrm{ok}} \def\compare{\mathrm{compare}} $

Better weighted precision and recall statistics

In general, we may define To get a specific precision-recall pair, we need to define "probability" and "matches". We consider the following transcript-level definitions of "probability". Here $\proj_\calA(\mathcal S) = \cup_{t\in B} \{s \in \calA : (s,t) \in \mathcal S\}$, $\proj_\calB$ is defined similarly, and $\matches$ is defined below. The distribution of $A$ and $B$ is given by: where $\tau_\calA$ and $\tau_\calB$ are the transcript-level expression levels, according to RSEM.

We consider the following definitions of "matches".

The above definitions are used to compute weighted_2_transcript_1_{precision, recall, F1}_at_$\alpha$_and_$\beta$ and weighted_2_transcript_2_{precision, recall, F1}_at_$\alpha$_and_$\beta$, as follows: With these definitions, the precision and recall measure how well individual oracleset elements are recovered by individual assembly elements.

We may also consider the following definitions of "matches", which lead to precision and recall statistics that measure how well individual oracleset elements are recovered by collections of (possibly) several assembly elements.

The elements $b^*_i(a)$ are defined as follows: For each $b\in\calB$, the mask $\mask_i(b)$ is defined as follows: The function $\fracOnes$ is defined as follows: The weighted_2_transcript_$i$_{precision, recall, F1}_at_$\alpha$_and_$\beta$ are computed for $i\in\{3,4\}$ as for $i\in\{1,2\}$ above.

Finally, we can consider an alternative definition of "probability", focused on nucleotide-level expression instead of transcript-level expression. Let $(a,k)\in\calA\times\N$ refer to contig $a$'s $k$th base, and define $(b,l)\in\calB\times\N$ similarly.

The distribution of $(A,K)$ and $(B,L)$ is given by: where $\tau_\calA$ and $\tau_\calB$ are the transcript-level expression levels, according to RSEM.

We define $\matches$ as follows. For each $i\in\{3,4\}$, let

This gives us weighted_2_nucleotide_3_{precision, recall, F1}_at_$\alpha$_and_$\beta$ and weighted_2_nucleotide_4_{precision, recall, F1}_at_$\alpha$_and_$\beta$, as follows:

Computation of the above

Main procedure: Procecure to compute $b_i^*(a)$: Procecure to compute $\mask_i(b)$: Procedure to compute $\matches_1(\calA,\calB)$: Procedure to compute $\matches_2(\calA,\calB)$: Procedure to compute $\matches_i(\calA,\calB)$ for $i\in\{3,4\}$: Procedure to compute $\matches_i(\calA\times\N,\calB\times\N)$ for $i\in\{3,4\}$: Procedure to compute weighted_2_transcript_i_{precision, recall, F1}_at_$\alpha$_and_$\beta$, for $i\in\{1,2,3,4\}$: Procedure to compute weighted_2_nucleotide_i_{precision, recall, F1}_at_$\alpha$_and_$\beta$, for $i\in\{1,2,3,4\}$: