Next: Recovery from indicator BOWs Up: Document Recovery from Bag-of-Word Previous: Baseline

Recovery from stopwords-removed count BOWs

In order for a document $\mathbf{d}$ to be in the feasible set $F(\mathbf{x})$ of a stopwords-removed count BOW $\mathbf{x}$ , each word in the vocabulary must occur $\mathbf{x}_v$ times in $\mathbf{d}$ , unless is a stopword. If is a stopword, it can occur any number of times in the document. As a result, we also do not know , the document length. This makes A search difficult. For example, a short document that only uses words in $\mathbf{x}$ may have a higher language model score than a longer document that also uses stopwords. Conversely, a long document that repeats a high-probability phrase of stopwords like ``and in the'' may have a higher per-word score than a moderate-length document that uses fewer stopwords. In fact, per-word score can increase monotonically with partial document length. To simplify the problem and reduce the size of the feasible set, we infer document length from $\mathbf{x}$ and limit the feasible set to documents of that length.

To construct an estimator for document length , given a stopwords-removed count BOW, we calculate the average ratio of document length (including stopwords) to BOW length (excluding stopwords) $\beta = \frac{1}{\vert D\vert} \sum_{\mathbf{d}_i \in D} \frac{n_i}{\mathbf{1}^\top \mathbf{x}_i}$ on a separate training set of documents, where each document $\mathbf{d}_i$ has length and a stopwords-removed BOW $\mathbf{x}_i$ , and $\mathbf{1}$ is the all-one vector. Our document-length estimator, given a BOW $\mathbf{x}$ with unknown and $\mathbf{d}$ , simply scales BOW length by $\beta$ : $\hat{n}(\mathbf{x}) = \beta \mathbf{1}^\top \mathbf{x}$ .

To find the best document $\mathbf{d}^* \in F(\mathbf{x})$ , we perform A search as before, with one difference: successor states are generated by appending words drawn from the remaining BOW (without replacement), as before, but also from the stopwords list (with replacement). The heuristic $h^{\mathrm{adm}}$ is still admissible, although it is farther from the true future path score than before, since $h^{\mathrm{adm}}$ does not sum over stopwords' scores.

Next: Recovery from indicator BOWs Up: Document Recovery from Bag-of-Word Previous: Baseline

Nathanael Fillmore 2008-07-18