Document recovery is a difficult problem, and results are sensitive to the
length of the documents. In order to compare our methods more effectively, we
created synthetic documents of varying lengths from each original document
and tested on these synthetic documents. Each synthetic document is comprised
of
contiguous real sentences from one original document, starting at a
random sentence in the original. If the original document had fewer than
sentences, we took the entire document; for
, this occurred 65 out of 100
times.
To evaluate the results of document recovery on each set of synthetic
documents, we use BLEU 4 [Papineni
2001], comparing the recovered documents
to the synthetic documents. All reported BLEU scores are averages within a set
of synthetic documents for a single and domain.
In all our experiments, we set
and
. We use the
small stopword list in [Manning and Schütze1999], which has 114 word types,
57 uppercase and 57 lowercase.