Motivated by computer privacy issues, we present the novel problem of document recovery from an index: given only a
document's bag-of-words (BOW) vector or other types of index, reconstruct the original
ordered document.
We investigate a variety of index
types, including count-based BOW vectors, stopwords-removed count BOW vectors,
indicator BOW vectors, and bigram count vectors.
We formulate the problem as hypothesis rescoring
with A
search and using the Google Web 1T 5-gram corpus.
Our experiments on five domains
indicate that if original documents are short,
the documents can be recovered with high accuracy.