In order for a document
to be in the feasible set
of an
indicator BOW
, each word
in the vocabulary where
must
occur at least once in
, and each word
where
must not
occur in
. As before, document length
is unknown, so we infer it
instead.
We use support vector machine regression to infer document length from
our indicator BOW
. We will train a function
on a training
set
of documents, where each document
has length
and an
indicator BOW
. However,
itself is unsuitable as a feature
vector: it has too many dimensions, so our estimator will suffer from data
sparseness if it uses
directly. Instead we create a ``count-count''
histogram
, where
is the number of words
types
such that
, the Google 1T 5-gram
corpus count of
, has
digits (
). The histogram
has 11 dimensions because no count in the
Google corpus has more than 11 digits. For example, let
, where the four nonzero word types are ``aardvark'', ``an'', ``apple'',
and ``ate'', with Google counts of 93030, 1369286818, 6878789, and 4763100
respectively. Then
, since the count of
``aardvark'' has five digits, the counts of ``apple'' and ``ate'' have seven
digits, and the count of ``an'' has ten digits. We use
along with BOW
length
as our feature vector. We use SVM
[Joachims1998] to train a support vector machine regression model
with a linear kernel.
To predict document length given a new indicator BOW
, we take the maximum of the prediction and the BOW length:
To find the best document
, we perform A
search as
for count BOWs, but when generating successor states, we append words drawn
from the remaining BOW with replacement rather than without replacement.
Additionally, we require that each word type in the BOW be used at least once
in the complete document. The heuristic
must be changed slightly, since
our BOW now contains
s and 0s, not counts: we require the summation in
equation 1 to range over a remaining BOW
in which each entry is
if
and
, and 0
otherwise. Since word types that will be reused in the future partial path are
not accounted for in this version of
, this version is farther from the
true future path score than
in the case of a count BOW. Nevertheless,
this
is still admissible, since repeated word types will only decrease
the future path score.