Document recovery is a difficult problem, and results are sensitive to the length of the documents. In order to compare our methods more effectively, we created synthetic documents of varying lengths from each original document and tested on these synthetic documents. Each synthetic document is comprised of contiguous real sentences from one original document, starting at a random sentence in the original. If the original document had fewer than sentences, we took the entire document; for , this occurred 65 out of 100 times.
To evaluate the results of document recovery on each set of synthetic
documents, we use BLEU 4 [Papineni
2001], comparing the recovered documents
to the synthetic documents. All reported BLEU scores are averages within a set
of synthetic documents for a single and domain.
In all our experiments, we set
and
. We use the
small stopword list in [Manning and Schütze1999], which has 114 word types,
57 uppercase and 57 lowercase.