In order for a document to be in the feasible set of a stopwords-removed count BOW , each word in the vocabulary must occur times in , unless is a stopword. If is a stopword, it can occur any number of times in the document. As a result, we also do not know , the document length. This makes A search difficult. For example, a short document that only uses words in may have a higher language model score than a longer document that also uses stopwords. Conversely, a long document that repeats a high-probability phrase of stopwords like ``and in the'' may have a higher per-word score than a moderate-length document that uses fewer stopwords. In fact, per-word score can increase monotonically with partial document length. To simplify the problem and reduce the size of the feasible set, we infer document length from and limit the feasible set to documents of that length.
To construct an estimator for document length , given a stopwords-removed count BOW, we calculate the average ratio of document length (including stopwords) to BOW length (excluding stopwords) on a separate training set of documents, where each document has length and a stopwords-removed BOW , and is the all-one vector. Our document-length estimator, given a BOW with unknown and , simply scales BOW length by : .
To find the best document , we perform A search as before, with one difference: successor states are generated by appending words drawn from the remaining BOW (without replacement), as before, but also from the stopwords list (with replacement). The heuristic is still admissible, although it is farther from the true future path score than before, since does not sum over stopwords' scores.