 
 
 
 
 
   
Document recovery is a difficult problem, and results are sensitive to the
length of the documents. In order to compare our methods more effectively, we
created synthetic documents of varying lengths  from each original document
and tested on these synthetic documents. Each synthetic document is comprised
of
 from each original document
and tested on these synthetic documents. Each synthetic document is comprised
of  contiguous real sentences from one original document, starting at a
random sentence in the original. If the original document had fewer than
 contiguous real sentences from one original document, starting at a
random sentence in the original. If the original document had fewer than  sentences, we took the entire document; for
sentences, we took the entire document; for  , this occurred 65 out of 100
times.
, this occurred 65 out of 100
times.
To evaluate the results of document recovery on each set of synthetic
documents, we use BLEU 4 [Papineni 
 
2001], comparing the recovered documents
to the synthetic documents. All reported BLEU scores are averages within a set
of synthetic documents for a single  and domain.
In all our experiments, we set
 and domain.
In all our experiments, we set 
 and
 and 
 . We use the
small stopword list in [Manning and Schütze1999], which has 114 word types,
57 uppercase and 57 lowercase.
. We use the
small stopword list in [Manning and Schütze1999], which has 114 word types,
57 uppercase and 57 lowercase.