We have formulated the document recovery problem, constructed several initial attempts to solve it, and produced a benchmark test dataset which we plan to share with the research community after review. Our results for all conditions improve on the greedy baseline. Most importantly, we have shown that if original documents are short or the index is a bigram count vector, documents can be recovered from indices with high BLEU4 scores. This has important implications in security and privacy.
In future research, we would like to explore syntactic constraints to limit the feasible set and to improve our score function. We would also like to treat other types of indices, such as TF-IDF vectors or BOWs with stemming.