next up previous
Next: Related work Up: Document Recovery from Bag-of-Word Previous: Document Recovery from Bag-of-Word

Introduction

With the proliferation of text processing software such as desktop search engines, database systems, and spam filters, computer users' personal files are increasingly being indexed to enable fast searches and processing. With great popularity comes greater risk, though: few users know where the index files are maintained and what security measures are used to protect them. In this paper, we seek to answer the following question: if an adversary obtains a document's index, what can it learn about the original document?

We define the problem of document recovery from an index as follows: given only a document's bag-of-words (BOW) vector or other index, reconstruct the original ordered document. Our goal is to understand the extent to which recovery is possible under standard computational means using large-scale language resources (i.e., the Google Web 1T 5-gram corpus).

The risk to privacy is real. An adversary may hack into a machine and gain access to the indices. 1 Software with such indices include desktop search engines such as Apache Lucene (lucene.apache.org), some e-mail clients such as sup (sup.rubyforge.org) and cone (www.courier-mta.org/cone), and database management systems like MySQL and PostgreSQL. Even more alarmingly, certain types of text classifiers (e.g., nearest neighbor and kernel machines) willingly distribute indices with the trained model to end users (and potentially the adversary). In this case, the model consists of the prototype training examples or the support vectors - documents in index form. The concern is that the training documents might be confidential or private in nature, and people should not assume the model is secure. In fact, the model files produced by both SVM$ ^{light}$ (svmlight.joachims.org) and LIBSVM (www.csie.ntu.edu.tw/~cjlin/libsvm) contain the support vectors in a readable ASCII format (even in the case of linear SVMs). Thus, anyone provided with a model file has access to a set of index files.

The approach adopted in this paper is similar to hypothesis rescoring, which has a long history in NLP. Nonetheless, this work provides several novel contributions: (1) A formulation of the problem of document recovery, which we believe is an important area for future NLP research. (2) A benchmark dataset that simulates domains where document recovery could be particularly detrimental (e.g., personal e-mail, government documents, medical records). (3) A comprehensive empirical study using the Google Web 1T 5-gram corpus and the resulting ``stupid backoff'' language model [Brants
\bgroupet al.\end{tex2html_bgroup}
2007
]. (4) A specialized A$ ^*$ decoder using novel search heuristics for document recovery from several types of indices.


next up previous
Next: Related work Up: Document Recovery from Bag-of-Word Previous: Document Recovery from Bag-of-Word
Nathanael Fillmore 2008-07-18