A search

Next: An admissible heuristic Up: Recovery from count BOWs Previous: Recovery from count BOWs

A search

For count BOWs, the feasible set $F(\mathbf{x})$ is the set of permutations on words in $\mathbf{x}$ . This set will be large for indices of even moderate size, so it is impractical to enumerate documents $\mathbf{d}\in F(\mathbf{x})$ . Instead, we approximately find the best document using memory-bounded A search [Russell and Norvig2002]. A search finds the minimal-cost path between a start and goal in a search space. It requires a measure of the cost from the start state to a particular state, and a heuristic estimate of the cost from the state to the goal state. If the estimate always underestimates the true cost to the goal, is called an admissible heuristic, and A search is guaranteed to find the minimal-cost path from start to goal.

To find $\mathbf{d}^*$ using A search, let each state in the search space represent a partial path , where is the length of the partial path. The start state consists of the start-of-sentence symbol $\mathtt{<}\mathrm{S}\mathtt{>}$ . The ``real'' goal state is the original document, but since this original document is unknown, the first complete path of length that is found is accepted as the goal. In the absence of search error, this will be the most likely length- path.

For our and , it is convenient to use score instead of cost; A search then maximizes path score instead minimizing path cost. The partial path score of a partial document is the log probability of the partial document under a language model: $g(w_1^l) \equiv \log p(w_1^l) = \sum_{i=1}^l \log p(w_i\vert w_1^{i-1})$ . We use the Google Web 1T 5-gram corpus and a 5-gram language model with stupid backoff [Brants
$\bgroupet al.\end{tex2html_bgroup}$
2007] to approximate the probabilities:

$\displaystyle p(w_i\vert w_{i-k+1}^{i-1}) = <tex2html_comment_mark>17 \begin{... ...) > 0 \ \alpha p(w_i\vert w_{i-k+2}^{i-1}) & \mathrm{otherwise} \end{cases}$

, and $p(w_i) = c(w_i)/\sum_{v\in \mathrm{vocab}} c(v)$ if

, where $c(\cdot)$ is the n-gram count in the corpus.

We present two heuristics for below: one that is admissible (i.e., always overestimates the score of the future path from the current state to the goal), and one that is not admissible but empirically performs better.

Next: An admissible heuristic Up: Recovery from count BOWs Previous: Recovery from count BOWs

Nathanael Fillmore 2008-07-18