next up previous
Next: Recovery from bigram count Up: Document Recovery from Bag-of-Word Previous: Recovery from stopwords-removed count

Recovery from indicator BOWs

In order for a document $ \mathbf{d}$ to be in the feasible set $ F(\mathbf{x})$ of an indicator BOW $ \mathbf{x}$, each word $ v$ in the vocabulary where $ x_v = 1$ must occur at least once in $ \mathbf{d}$, and each word $ v$ where $ x_v = 0$ must not occur in $ \mathbf{d}$. As before, document length $ n$ is unknown, so we infer it instead.

We use support vector machine regression to infer document length $ \hat n$ from our indicator BOW $ \mathbf{x}$. We will train a function $ f(\mathbf{x})=n$ on a training set $ D$ of documents, where each document $ \mathbf{d}$ has length $ n$ and an indicator BOW $ \mathbf{x}$. However, $ \mathbf{x}$ itself is unsuitable as a feature vector: it has too many dimensions, so our estimator will suffer from data sparseness if it uses $ \mathbf{x}$ directly. Instead we create a ``count-count'' histogram $ \mathbf{h}\in \mathbb{Z}_+^{11}$, where $ h_j$ is the number of words types $ v \in \mathbf{x}$ such that $ c_{\mathrm{google}}(v)$, the Google 1T 5-gram corpus count of $ v$, has $ j$ digits ( $ 10^{j-1} \leq c_{\mathrm{google}}(v) <
10^j$). The histogram $ \mathbf{h}$ has 11 dimensions because no count in the Google corpus has more than 11 digits. For example, let $ \mathbf{x}= (1 0 1 1 0 1 0
...)$, where the four nonzero word types are ``aardvark'', ``an'', ``apple'', and ``ate'', with Google counts of 93030, 1369286818, 6878789, and 4763100 respectively. Then $ \mathbf{h}=(0 0 0 0 1 0 2 0 0 1 0)$, since the count of ``aardvark'' has five digits, the counts of ``apple'' and ``ate'' have seven digits, and the count of ``an'' has ten digits. We use $ \mathbf{h}$ along with BOW length $ \mathbf{1}^\top\mathbf{x}$ as our feature vector. We use SVM$ ^{light}$ [Joachims1998] to train a support vector machine regression model $ \hat f(\mathbf{h}, \mathbf{1}^\top\mathbf{x})$ with a linear kernel. To predict document length given a new indicator BOW $ \mathbf{x}$, we take the maximum of the prediction and the BOW length:

$\displaystyle \hat n = \max(\mathrm{round}(\hat f(\mathbf{h}, \mathbf{1}^\top \mathbf{x})), \mathbf{1}^\top \mathbf{x}).
$

To find the best document $ \mathbf{d}^* \in F(\mathbf{x})$, we perform A$ ^*$ search as for count BOWs, but when generating successor states, we append words drawn from the remaining BOW with replacement rather than without replacement. Additionally, we require that each word type in the BOW be used at least once in the complete document. The heuristic $ h^{\mathrm{adm}}$ must be changed slightly, since our BOW now contains $ 1$s and 0s, not counts: we require the summation in equation 1 to range over a remaining BOW $ \mathbf{x}\backslash w_1^l$ in which each entry is $ 1$ if $ x_v = 1$ and $ v \not \in w_1^l$, and 0 otherwise. Since word types that will be reused in the future partial path are not accounted for in this version of $ h^{\mathrm{adm}}$, this version is farther from the true future path score than $ h^{\mathrm{adm}}$ in the case of a count BOW. Nevertheless, this $ h^{\mathrm{adm}}$ is still admissible, since repeated word types will only decrease the future path score.


next up previous
Next: Recovery from bigram count Up: Document Recovery from Bag-of-Word Previous: Recovery from stopwords-removed count
Nathanael Fillmore 2008-07-18