In order for a document
to be in the feasible set
of a
stopwords-removed count BOW
, each word
in the vocabulary must occur
times in
, unless
is a stopword. If
is a stopword, it
can occur any number of times in the document. As a result, we also do not know
, the document length. This makes A
search difficult.
For example, a short document that only uses words in
may have a
higher language model score than a longer document that also uses stopwords.
Conversely, a long document that repeats a high-probability phrase of stopwords
like ``and in the'' may have a higher per-word score than a moderate-length
document that uses fewer stopwords. In fact, per-word score can increase
monotonically with partial document length. To simplify the problem and reduce
the size of the feasible set, we infer document length from
and limit
the feasible set to documents of that length.
To construct an estimator for document length , given a stopwords-removed
count BOW, we calculate the average ratio of document length (including
stopwords) to BOW length (excluding stopwords)
on a separate training set
of documents, where each document
has
length
and a stopwords-removed BOW
, and
is the
all-one vector. Our document-length estimator, given a BOW
with unknown
and
, simply scales BOW length by
:
.
To find the best document
, we perform A
search as
before, with one difference: successor states are generated by appending words
drawn from the remaining BOW (without replacement), as before, but also from
the stopwords list (with replacement). The heuristic
is still
admissible, although it is farther from the true future path score than before,
since
does not sum over stopwords'
scores.