next up previous
Next: Recovery from indicator BOWs Up: Document Recovery from Bag-of-Word Previous: Baseline

Recovery from stopwords-removed count BOWs

In order for a document $ \mathbf{d}$ to be in the feasible set $ F(\mathbf{x})$ of a stopwords-removed count BOW $ \mathbf{x}$, each word $ v$ in the vocabulary must occur $ \mathbf{x}_v$ times in $ \mathbf{d}$, unless $ v$ is a stopword. If $ v$ is a stopword, it can occur any number of times in the document. As a result, we also do not know $ n$, the document length. This makes A$ ^*$ search difficult. For example, a short document that only uses words in $ \mathbf{x}$ may have a higher language model score than a longer document that also uses stopwords. Conversely, a long document that repeats a high-probability phrase of stopwords like ``and in the'' may have a higher per-word score than a moderate-length document that uses fewer stopwords. In fact, per-word score can increase monotonically with partial document length. To simplify the problem and reduce the size of the feasible set, we infer document length from $ \mathbf{x}$ and limit the feasible set to documents of that length.

To construct an estimator for document length $ n$, given a stopwords-removed count BOW, we calculate the average ratio of document length (including stopwords) to BOW length (excluding stopwords) $ \beta = \frac{1}{\vert D\vert} \sum_{\mathbf{d}_i \in D} \frac{n_i}{\mathbf{1}^\top \mathbf{x}_i}$ on a separate training set $ D$ of documents, where each document $ \mathbf{d}_i$ has length $ n_i$ and a stopwords-removed BOW $ \mathbf{x}_i$, and $ \mathbf{1}$ is the all-one vector. Our document-length estimator, given a BOW $ \mathbf{x}$ with unknown $ n$ and $ \mathbf{d}$, simply scales BOW length by $ \beta$: $ \hat{n}(\mathbf{x}) = \beta \mathbf{1}^\top \mathbf{x}$.

To find the best document $ \mathbf{d}^* \in F(\mathbf{x})$, we perform A$ ^*$ search as before, with one difference: successor states are generated by appending words drawn from the remaining BOW (without replacement), as before, but also from the stopwords list (with replacement). The heuristic $ h^{\mathrm{adm}}$ is still admissible, although it is farther from the true future path score than before, since $ h^{\mathrm{adm}}$ does not sum over stopwords' $ s^*$ scores.


next up previous
Next: Recovery from indicator BOWs Up: Document Recovery from Bag-of-Word Previous: Baseline
Nathanael Fillmore 2008-07-18