The feasible set of a bigram count vector is highly constrained. Document
length can be recovered exactly:
, since
each word token in the original document except the last token is counted in
the first position of one bigram. Additionally, the bigram
must
occur
times in
for all
.
We use these constraints to make the problem easier to solve. To motivate our
discussion, suppose that a bigram count vector
has count 1 for each of
``the dog'', ``dog runs'', ``runs quickly'', and ``runs slowly''.
There is ambiguity: we do not know whether ``the dog runs
quickly'' or ``the dog runs slowly'' occurred in the original document. We do
still know that ``the dog runs'' occurred.
We call document fragments that are unambiguously determined by the bigram
count vector sticks. Before we start our search procedure proper, we
construct a stick for each bigram
such that
. We start with
. If we can
extend leftward without ambiguity, that is, if there exists a
such
that (a)
, (b) the bigram
will occur
no more than
times in the new stick, and (c)
for all
,
,
then we extend
. Similarly, if we can extend
rightward without ambiguity, we extend
. A
single stick can be extended both leftward and rightward. We repeatedly extend
each stick until an ambiguity is reached. We extend rightward as far as
possible before extending leftward.
Once we have our sticks, we search for the best document
using A
search without heuristic. We extend each partial path by
appending an unused stick to the partial path, and ignoring the beginning of
the stick if it overlaps with the end of the partial path.