next up previous
Next: Recovery from count BOWs Up: Document Recovery from Bag-of-Word Previous: Related work


Problem formulation

Let $ \mathbf{d}= w_1^n$ be a document with $ n$ words. We are interested in four types of indices:

1. Count BOW. The index is a vector $ \mathbf{x}= (x_1,...,x_V) \in
\mathbb{Z}_+^V$, where $ x_v$ is the number of times word $ v \in \mathrm{vocab}$ occurs in $ \mathbf{d}$ and $ V$ is the size of vocabulary $ \mathrm{vocab}$.

2. Stopwords-removed count BOW. The index $ \mathbf{x}= (x_1,...,x_V) \in
\mathbb{Z}_+^V$, where $ x_v$ is 0 if $ v\in
\mathrm{stopwords}$, and the number of times $ v$ occurs in $ \mathbf{d}$ otherwise.

3. Indicator BOW. The index $ \mathbf{x}= (x_1,...,x_V) \in \{0,1\}^V$, where $ x_v$ is $ 1$ if $ v$ occurs in $ \mathbf{d}$, and 0 otherwise.

4. Bigram count vector. The index $ \mathbf{x}= (x_{11},...,x_{VV}) \in
\mathbb{Z}_+^{V^2}$, where $ x_{ij}$ is the number of times the ordered pair $ v_i, v_j$ occurs in $ \mathbf{d}$.

For any index $ \mathbf{x}$, the feasible set $ F(\mathbf{x})$ is defined as the set of documents that are consistent with $ \mathbf{x}$, that is, the set of documents each having index $ \mathbf{x}$: $ F(\mathbf{x}) = \{\mathbf{d}: \mathbf{x}(\mathbf{d}) = \mathbf{x}\}$. The document recovery problem is to find the best document in the feasible set given an index: $ \mathbf{d}^* = \mathrm{arg} \max _{\mathbf{d}\in F(\mathbf{x})} \mathrm{score}(\mathbf{d})$ for some score function. For large $ F(\mathbf{x})$, $ \mathbf{d}^*$ may be impractical to compute exactly. In what follows, we define the feasible sets, score functions, and methods of approximating $ \mathbf{d}^*$ for each index type.


next up previous
Next: Recovery from count BOWs Up: Document Recovery from Bag-of-Word Previous: Related work
Nathanael Fillmore 2008-07-18