Next: Experiments Up: Document Recovery from Bag-of-Word Previous: Recovery from indicator BOWs

Recovery from bigram count vectors

The feasible set of a bigram count vector is highly constrained. Document length can be recovered exactly: $n = \mathbf{1}^\top \mathbf{x}+ 1$ , since each word token in the original document except the last token is counted in the first position of one bigram. Additionally, the bigram must occur $x_{w_1,w_2}$ times in $\mathbf{d}$ for all $\mathbf{d}\in F(\mathbf{x})$ .

We use these constraints to make the problem easier to solve. To motivate our discussion, suppose that a bigram count vector $\mathbf{x}$ has count 1 for each of ``the dog'', ``dog runs'', ``runs quickly'', and ``runs slowly''. There is ambiguity: we do not know whether ``the dog runs quickly'' or ``the dog runs slowly'' occurred in the original document. We do still know that ``the dog runs'' occurred.

We call document fragments that are unambiguously determined by the bigram count vector sticks. Before we start our search procedure proper, we construct a stick for each bigram $w_{i_1},w_{i_2}$ such that $x_{w_{i_1},w_{i_2}} > 0$ . We start with $t_i = w_{i_1},w_{i_2}$ . If we can extend leftward without ambiguity, that is, if there exists a $w_{i_3}$ such that (a) $x_{w_{i_3},w_{i_1}} > 0$ , (b) the bigram $w_{i_3},w_{i_1}$ will occur no more than $x_{w_{i_3},w_{i_1}}$ times in the new stick, and (c) $x_{w_{i_4},w_{i_1}} = 0$ for all $w_{i_4} \in \mathrm{vocab}$ , $w_{i_4} \neq w_{i_3}$ , then we extend $t_i = w_{i_3},w_{i_1},w_{i_2}$ . Similarly, if we can extend rightward without ambiguity, we extend $t_i = w_{i_1},w_{i_2},w_{i_3}$ . A single stick can be extended both leftward and rightward. We repeatedly extend each stick until an ambiguity is reached. We extend rightward as far as possible before extending leftward.

Once we have our sticks, we search for the best document $\mathbf{d}^* \in F(\mathbf{x})$ using A search without heuristic. We extend each partial path by appending an unused stick to the partial path, and ignoring the beginning of the stick if it overlaps with the end of the partial path.

Next: Experiments Up: Document Recovery from Bag-of-Word Previous: Recovery from indicator BOWs

Nathanael Fillmore 2008-07-18