For count BOWs, the feasible set
is the set of permutations on
words in
. This set will be large for indices of even
moderate size, so it is impractical to enumerate documents
. Instead, we approximately find the best document using memory-bounded
A
search [Russell and Norvig2002]. A
search finds the minimal-cost
path between a start and goal in a search space. It requires a measure
of
the cost from the start state to a particular state, and a heuristic estimate
of the cost from the state to the goal state. If the estimate
always
underestimates the true cost to the goal,
is called an admissible heuristic,
and A
search is guaranteed to find the minimal-cost path from start to
goal.
To find
using A
search, let each state in the search space
represent a partial path
, where
is the length of the partial path.
The start state consists of the start-of-sentence symbol
. The ``real''
goal state is the original document, but since this original document is
unknown, the first complete path of length
that is found is accepted as the
goal. In the absence of search error, this will be the most likely length-
path.
For our and
, it is convenient to use score instead of
cost; A
search then maximizes path score instead minimizing path
cost. The partial path score
of a partial document is the log probability
of the partial document under a language model:
.
We use the Google Web 1T 5-gram corpus and a 5-gram language model with stupid
backoff [Brants
2007] to approximate the probabilities:
![]() |
We present two heuristics for below: one that is admissible (i.e., always
overestimates the score of the future path from the current state to the goal),
and one that is not admissible but empirically performs better.