In order for a document to be in the feasible set of an indicator BOW , each word in the vocabulary where must occur at least once in , and each word where must not occur in . As before, document length is unknown, so we infer it instead.
We use support vector machine regression to infer document length from our indicator BOW . We will train a function on a training set of documents, where each document has length and an indicator BOW . However, itself is unsuitable as a feature vector: it has too many dimensions, so our estimator will suffer from data sparseness if it uses directly. Instead we create a ``count-count'' histogram , where is the number of words types such that , the Google 1T 5-gram corpus count of , has digits ( ). The histogram has 11 dimensions because no count in the Google corpus has more than 11 digits. For example, let , where the four nonzero word types are ``aardvark'', ``an'', ``apple'', and ``ate'', with Google counts of 93030, 1369286818, 6878789, and 4763100 respectively. Then , since the count of ``aardvark'' has five digits, the counts of ``apple'' and ``ate'' have seven digits, and the count of ``an'' has ten digits. We use along with BOW length as our feature vector. We use SVM [Joachims1998] to train a support vector machine regression model with a linear kernel. To predict document length given a new indicator BOW , we take the maximum of the prediction and the BOW length:
To find the best document , we perform A search as for count BOWs, but when generating successor states, we append words drawn from the remaining BOW with replacement rather than without replacement. Additionally, we require that each word type in the BOW be used at least once in the complete document. The heuristic must be changed slightly, since our BOW now contains s and 0s, not counts: we require the summation in equation 1 to range over a remaining BOW in which each entry is if and , and 0 otherwise. Since word types that will be reused in the future partial path are not accounted for in this version of , this version is farther from the true future path score than in the case of a count BOW. Nevertheless, this is still admissible, since repeated word types will only decrease the future path score.