%
% File acl08.tex
%
% Contact: nasmith@cs.cmu.edu

\documentclass[11pt]{article}
\usepackage{acl08}
\usepackage{times}
\usepackage{latexsym}

\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\usepackage{xspace}
\usepackage{url}
\urlstyle{same}

\usepackage{orange}

\setlength\titlebox{6.5cm}    % Expanding the titlebox

\newcommand{\Astar}{\mbox{A$^*$}\xspace}
\newcommand{\Sstar}{\mbox{S$^*$}\xspace}
\newcommand{\argmax}{\mathrm{arg}\,\max}
\newcommand{\argmin}{\mathrm{arg}\,\min}
\newcommand{\rembow}{\mathrm{Rem.~BOW}}
\newcommand{\bow}{\mathrm{BOW}}
\newcommand{\perm}{\mathrm{perm}}
\newcommand{\stopwords}{\mathrm{stopwords}}
\newcommand{\Sbeg}{\mathtt{<}\mathrm{S}\mathtt{>}}
\newcommand{\Send}{\mathtt{<}\mathtt{/}\mathrm{S}\mathtt{>}}
\newcommand{\Dbeg}{\mathtt{<}\mathrm{D}\mathtt{>}}
\newcommand{\Dend}{\mathtt{<}\mathtt{/}\mathrm{D}\mathtt{>}}

\title{Document Recovery from Bag-of-Word Indices}

%\author{Joakim Nivre \\
%  School of Mathematics and Systems Engineering \\
%  V\"{a}xj\"{o} University \\
%  SE-35195, V\"{a}xj\"{o}, Sweden \\
%  {\tt nivre@msi.vxu.se} \And
%  Noah A. Smith \\
%  Language Technologies Institute \\
%  Carnegie Mellon University \\
%  Pittsburgh, PA 15213, USA\\
%  {\tt nasmith@cs.cmu.edu}}

\date{}

\begin{document}
\maketitle
\begin{abstract}
\end{abstract}

\section{Introduction}

With the proliferation of text processing software such as Google Desktop, databases storing full-text indices, and spam filters, 
computer users' personal files are increasingly being indexed and stored in inverted index files or other representations (e.g., term frequency vectors) to enable fast searches and other operations.
With great popularity comes great responsibility, though: few users know where the index files are maintained and what security measures are used to protect them.
In this paper, we seek to answer the following question: If an index file were to end up in a malicious party's hands, what could he or she learn about your original documents?

We are interested in the problem of \emph{document recovery from index files}.
Informally, we define this as follows: given an index or collection of term frequency or bag-of-words (BOW) vectors, the task is to reconstruct the original \emph{ordered} text documents. 
Our goal is to understand this problem in more detail and discover what steps must be taken to protect individuals' privacy in the face of today's text processing needs.

We divide the sources of potential privacy risks into two categories.
In the first category, index files are not actively distributed to end users.
A third party must actively hack into a machine to access the indices,\footnote{Throughout this paper, we assume that encrypted index files could be decrypted using standard techniques.} 
at which point he or she may be able to discover the full contents of the original documents, even if word order was eliminated in the indexing process.
Specific examples of software in this first category are desktop search engines like Google Desktop and Mac OS X's Spotlight tool, as well as the popular open-source search engine Apache Lucene,\footnote{\protect\url{lucene.apache.org}} which provides the option to store only BOW vectors in its index. 
Also, some e-mail clients, such as \emph{sup}\footnote{\protect\url{sup.rubyforge.org}} and \emph{cone},\footnote{\protect\url{www.courier-mta.org/cone}} store local indices of messages stored on a secure server.
Finally, widely used database management systems like MySQL and PostgreSQL include full-text search capabilities that produce index files and term frequency vectors.
Note that in all these cases, users might assume the indices are less sensitive than the original documents, which may be stored elsewhere.
We hope to show that indices alone provide a powerful tool to access private information.

In the second category of document recovery scenarios, index files or document vectors are actually distributed to end users.
That is, they are \emph{given directly to potential hackers}. 
This is the case for classification tools, where someone has trained a model and wishes to distribute it to others to classify new documents. 
The concern is that the training documents might be confidential or private in nature, and people should not assume the model is secure. 
As a simple example, consider a (na\"ive) server administrator that builds a spam classifier using all e-mails marked as spam by all users, and then distributes the model to users' local mail clients.
Other example applications are news readers that categorize documents based on a previously seen training corpus, or sentiment analysis tools trained on potentially private product reviews. 
For a nearest neighbor classifier, the model is simply the set of training vectors.
More realistically, if the classifier is a support vector machine (SVM), the model contains the "support vectors," which are nothing more than actual training examples close to the decision boundary. 
If one reverse engineers the model, it is possible to recover the (important, support vector) documents that were used to train the model.
In fact, the model files produced by both SVM$^{light}$\footnote{\protect\url{svmlight.joachims.org}} and LIBSVM\footnote{\protect\url{www.csie.ntu.edu.tw/~cjlin/libsvm}} contain the support vectors in a readable ASCII format (even in the case of linear SVMs).
Thus, anyone provided with a model file has access to a potentially large set of training documents in some vector space.
We explore the case of BOW vectors and indicator vectors, as well as vectors with stop words removed.

All of the aforementioned software retains some record of which words appear in which documents, though explicit order information is removed.
Note this is similar (if not identical) to a collection of BOW vectors, where each vector represents a single document and contains counts or indicators of the words present. 
It has recently been shown that using only a large set of BOW vectors, a third party can learn a simple bigram model and recover some simple documents~\cite{Zhu2008Learning}. \orange{Is this anonymous enough?}.
In this paper, we explore the use of external language resources like a large n-gram model trained on Web text to attempt to recover ordered documents by decoding BOW vectors.
We present a thorough study of this problem of document recovery to gain insight into safeguarding private documents, as well as novel text compression methods.
Our methods may also be useful for restoring files when an index is available, but the original documents have been inadvertently deleted.

Our basic approach to document recovery is to find the most likely document given the index.
This is implemented as hypothesis rescoring, where the hypotheses are documents that are consistent with the index.
We note that hypothesis rescoring itself is not novel and has been an important component in many previous machine translation and automatic speech recognition systems.
However, our work has several novel contributions:
\begin{itemize}
\item A novel formulation of the problem of document recovery, which we believe is an important area for future NLP research.
\item Several benchmark data sets drawn from domains 
where document recovery could be particularly detrimental (e.g., personal e-mail, government documents, medical records).
\item A comprehensive empirical study using Google 1T 5gram and the resulting ``stupid backoff'' language model~\cite{Brants07}.
\item A specialized \Astar decode using novel search heuristics for document recovery under a variety of reduced document conditions (e.g., BOW vectors, indicator vectors, vectors after stop word removal, bag-of-bigrams vectors).
\end{itemize}

We discuss related work in Section~\ref{sec:related}, then present our formal definition of document recovery and our methods in Section~\ref{sec:methods}.
We introduce our benchmark data sets and present experimental results illustrating the validity of our methods in Section~\ref{sec:experiments}.
Finally, we conclude with discussions in Section~\ref{sec:discussions}.


\section{Related work}
\label{sec:related}
In the NLP domain, there has been a great deal of machine translation and automatic speech recognition research in hypothesis rescoring.
We only have room to mention a few salient examples here.
In statistical machine translation (SMT), various decoding strategies have been explored~\cite{Knight1999Decoding,Germann2003Greedy}. 
The basic task is to produce the best sequence of words given a set of input words, a translation model, and a language model.
This has been addressed using greedy best-first search without any search heuristics, and has been formulated as a traveling salesman problem (i.e., finding the lowest-cost path that visits all the required words)~\cite{Germann2001Fast}.
Beam search is also commonly used, at the word, phrase, or n-gram level~\cite{Hoang08,Crego2005Ngram}.
Finally, \Astar search, which we use in the current work, has been used in SMT. Och et al.~\cite{och2001} develop several admissible or almost admissible heuristics used for decoding in machine translation. 
Our work differs from all of the above in that we already know (at least some of) the words that must appear in the recovered document. While this may appear to simplify the problem, our plausible search space may in fact be much larger. 
In SMT, the possible permutations of target words is constrained based on the \emph{ordered} source language document~\cite{kanthak05novel,bangalore07}.
We lack such constraints in our task. We also consider other reduced document conditions where we must decide how many additional words to introduce without the guide of a source document.

In another NLP application, Athanaselis et al.~\cite{athanaselis06} search for reorderings of words in the context of automatic grammar checking and finding word order errors. The method uses a large language model (trained on the 100 million word British National Corpus) for guiding a filtering-based search process: the search space is pruned of permutations that contain non-existent bigrams. The goal is to maximize the probability of trigrams in the reordered sentence. Our work differs in that we use a higher-order (upto $n=5$) language model trained on several orders of magnitude more text, and we employ \Astar search to find the optimal (highest overall language model probability) ordered document. Our approach is also designed to work on much longer test documents, not just short sentences (4--12 words). 

\orange{I included this because there was a placeholder for it, but the connection to these topics is pretty tenuous?}
At a high level, the current research also bears similarity to data mining and security research that attempts to recover hidden knowledge from published sources.
Staddon et al.~\cite{Staddon07} study what inferences can be made when newly published data is combined with existing information on the Web or other public sources.
We also try to measure what can be learned by combining reduced document representations with large language models, though the methods involved are very different. 
Another task related to document recovery is that of uncovering missing words in declassified documents that have been redacted or sanitized. 
In this case, ordered text documents are provided, but some spans of text have been blacked out, and one must try to predict what words are missing.
This task was first popularized in the press when David Naccache and Claire Whelan demoed software at Eurocrypt 2004 in which they predicted blacked-out words in a White House briefing regarding September 11\footnote{\protect\url{www.nytimes.com/2004/05/10/technology/10crypto.html}}.
Again, this is different than our notion of document recovery, but it addresses similar questions of privacy leakage from obscured or modified text documents.


\section{Methods}

Our problem is as follows: we seek 
\[
{w^*}_1^D = \argmax_{w_1^D\in \perm(\bow)} p(w_1^D),
\]
where $D$ is the document length.  \orange{(Actually
this is not true. Should we formulate the problem in terms of BLEU? or edit
distance to the true doc, etc.)}

Our problem is as follows: we seek 
\[
{w^*}_1^D = \argmax_{w_1^D\in \perm(\bow)} \mathrm{distance}(w_1^D, \mathrm{true}_1^D),
\]
where $D$ is the document length, and $\mathrm{true}_1^D$ is the unknown
original document.

\subsection{Baseline}

First we implemented a greedy baseline: 
\[
w_i = \argmin_{v \in V \backslash w_1^{i-1}} -\log p(v|w_{i-4}^{i-1})
\]
for $i=1\dots D$.

\subsection{\Astar search}

We solve our problem using memory-bounded \Astar search. Each point in the
search space represents a partial sequence $w_1^l$, where $l$ is the length of
the partial sequence. The start state is ${\Sbeg}$. Path cost from ${\Sbeg}$ to
any $w_1^l$ is defined as
\[
g(w_1^l) = -\log p(w_1^l) = -\sum_{i=1}^l \log p(w_i|w_{i-4}^{i-1}),
\]
% -\log p(w_1) - \log p(w_2|w_1) - \dots - \log p(w_l|w_{l-4}^{l-1}), 
where $l$ is the length of the partial sequence. A stupid-backoff language
model is used to approximate the probabilities:
%
\begin{align*}
p(w_i|w_{i-k+1}^{i-1}) 
  \approx
  \begin{cases}
    \frac{c(w_{i-k+1}^i)}{c(w_{i-k+1}^{i-1})} & \mathrm{if}\, c(w_{i-k+1}^i) > 0 \\
    \alpha p(w_i|w_{i-k+2}^{i-1})             & \mathrm{otherwise}
  \end{cases},
\end{align*}
%
if $k > 1$, and $p(w_i) = c(w_i)/\sum_{v\in V} c(v)$ if $k = 1$. \orange{Should
I mention $\alpha$'s value here? Should I use $\approx$ in the unigram prob
equation? I think it actually *is* a probability, as is the $>0$ condition of
the n-gram equation. maybe I should use = for both.}

\subsection{Admissible heuristic}

% In order to carry out the search efficiently we need a heuristic that is fast
% but still gives reasonably good results. Several heuristics we tried were
% either too slow to calculate or too far from the true cost to the goal.
% \orange{need better transition}

We seek an upper bound on:
\[
h^{\textrm{true}}(w_1^l) = \argmin_{w_{l+1}^D \in \,\perm(\bow \backslash w_1^l)} -\log p(w_{l+1}^D)
\]
One possibility,
\[
h_1(w_1^l) = \sum_{v_5 \in \bow \backslash w_1^l} 
  \min_{v_1^4 \in \perm(\bow \backslash w_1^{l-4})} -\log p(v_5|v_1^4),
\]
is admissible but impractical, since (a) we need to compute the $\min$s for
every candidate $w_1^l$, and (b) $\perm(\bow \backslash w_1^{l-4})$ is a huge
set for small $l$. Problem (a) is avoided by
\[
h_2(w_1^l) = \sum_{v_5 \in \bow \backslash w_1^l} 
  \min_{v_1^4 \in \perm(\bow)} -\log p(v_5|v_1^4),
\]
since we can precompute the $\min$ for all $v_5 \in V$. Problem (b) is solved
by
\[
h_3(w_1^l) = -\sum_{v_5 \in \bow \backslash w_1^l} \Sstar(v_5),
\]
where
\[
  \Sstar(v_5) = \min_{v_1^4 \in \perm(\bow)} S(v_5|v_1^4),
\]
and
\[
S(v_5|v_i^4) = \min(-\log p(v_5|v_i^4), \alpha S(v_5|v_{i+1}^4)),
\]
for $i = 1,\dots,4$, $S(v_5) = -\log p(v_5)$.  Heuristic $h_3$ is admissible,
since $S(v_5|v_1^4) \geq -\log p(v_5|v_1^4)$ for all $v_1^5$ \orange{(should I
give further proof?)}, and efficient, since $\Sstar$ can be computed for all
$v_5 \in V$ in two passes through a list of ngram counts. On the first pass,
build a map from ngrams to counts. (5-grams can be ignored.) On the second
pass, for every ngram $v_1^n$ in the list, compute
\[
S' = \alpha^{5-n} \frac{c(v_1^n)}{c(v_1^{n-1})}.
\]
Compare $S'$ to $S^*(v_n)$, the best score for $v_n$ seen so far. If $S' >
S^*(v_N)$, set $S^*(v_N) = S'$. (Initially set $S^*(v_N) = 0$.)

\subsection{Empirical heuristic}

This is not admissible. Run \Astar search once and
collect, for each $w_5$ in vocab, the $max_{w_1^4} p(w_5|w_1^4)$, only looking
at $w_1^4$ you saw in your first search while calculating incremental g. Then
you run \Astar again and use those probabilities as the heuristic.

\subsection{Pruning strategy}

Even using \Astar search with our heuristics, the search space that needs to be
explored is huge and memory is frequently exhausted before a solution is found. So
we need to prune the priority queue of candidates. One strategy is to prune
candidates based on the same score we pop candidates, but looking at the
maximum instead of the minimum: prune 
\[
\argmax_{w_1^l \in \mathrm{candidates}} g(w_1^l) + h(w_1^l)
\]
when the priority queue's length exceeds a threshold. However, since $h$
underestimates the true cost, as $l$ increases $g(w_1^l) + h(w_1^l)$ tends to
decrease, so this pruning strategy tends to prune long candidates before
shorter ones. This is not what we want becuase in fact longer candidates are
\textit{more} likely to be ``good'' than short ones, since the
\textit{best}-seeming $(l-1)$-length partial sequence is the first candidate to
be expanded to partial sequences of length $l$.

A better strategy is to prune based on 
%
\begin{align}
\label{eqn:pruneplus}
\argmax_{w_1^l \in \mathrm{candidates}} g(w_1^l) + h(w_1^l)/l.
\end{align}
%
This corrects the bias caused by an imperfect $h$.

Another strategy which works well in practice is 
%
\begin{align}
\label{eqn:prune1000}
\argmax_{w_1^l \in \mathrm{candidates}} -k\,l + g(w_1^l) + h(w_1^l),
\end{align}
%
where $k$ is large enough that all candidates of length $l-1$ are pruned before
candidates of length $l$. This directly exploits the earlier insight that the
worst candidates will be expanded last, so when we prune short partial
sequences are most likely to be ``bad'', i.e. are most likely to be safe to
prune. A potential tradeoff between the \eqref{eqn:pruneplus} vs
\eqref{eqn:prune1000} is that the latter, if a partial sequence that looks good
turns out to be really bad, might not be able to backtrack very far, while
\eqref{eqn:pruneplus} will likely still have short nodes to fall back on; on the
other hand, \eqref{eqn:prune1000} will have 

\section{Stopwords BOW}

In practice, programs sometimes index docs that have stopwords removed.
Recovery from such a stopwords BOW can be formulated as follows:
\[
{w^*}_1^D = \argmax_{w_1^D \in \perm(BOW \cup \stopwords)} \log p(w_1^D)
\]
In the count-BOW problem, D = $|\mathrm{BOW}|$, the number of words tokens in
the BOW. When stopwords are removed, this is no longer true, and D becomes a
random variable. To find an estimator for D, we calculate
%
\begin{align*}
\beta &= \frac{1}{|U|} \sum_{u_1^l \in U} \frac{l}{l-|\stopwords(u_1^l)|} \\
      &= \frac{1}{|U|} \sum_{u_1^l \in U} \frac{l}{|\bow(u_1^l)|}
\end{align*}
%
for some training set $U$. Now
\[
\hat{D}(\bow) = \beta |\bow|
\]

In the \Astar search, successors are generated by drawing words from either the
remaining BOW (without replacement) or from the stopwords list (with
replacement). $h_3$ remains an admissible heuristic if $\Sstar$ is calculated as
\[
  \Sstar(v_5) = \min_{v_1^4 \in \perm(\bow \cup \stopwords)} S(v_5|v_1^4),
\]
However, this $h_3$ will be farther from the truth than the count-BOW $h_3$
was, since the sum remains over $v_5\in \bow \backslash w_1^l$. An empirical
heuristic can be built on top of the new $h_3$ as before.

\section{Indicator BOW}

Often software records just the presence of a word in a document, not the
word's count. Recovery from an such an indicator BOW can be formulated as
follows:
\[
{w^*}_1^D = \argmax_{w_1^D \sim \bow \cup \stopwords} \log p(w_1^D)
\]
where the drawing is done with replacement. As with stopwords BOWs, for
indicator BOWs $D$ is a random variable. We learn an estimator as follows. Let
$H(\bow) \in \mathbb{N}^{11} = [H_j]^T$, where $1(.)$ is the indicator
function, $H_j = \sum_{u_i \in \bow} 1_{\mathrm{digits}(c(u_i))=j}$, and
$\mathrm{digits}(c)$ is the number of digits in c's base-10 representation. (No
count in the Google corpus that we use has more than 11 digits.) We now train a
regression model $\hat{f}$ using $(H(u_1^l), l)$ pairs, $u_1^l \in U$. Our estimator
is
\[
\hat{D}(\bow) = \max(\mathrm{round}(\hat{f}(H(\bow))), |\bow|)
\]

\section{Bigram BOW}

A bigram BOW is a set of bigrams tokens; for example $\bow($Sometimes a rose is
a rose .$)$ = $\{$ a rose, a rose, is a, rose is, rose ., Sometimes a$\}$.
Document length $D$ can be recovered exactly: $D = |BOW|+1$.

Some of these bigrams overlap unambiguously. We can create ``sticks'',
unambiguous phrases, as follows: for each bigram $v_1^2 \in \bow$, try to
extend it to the right and to the left until there is more than one choice how
to extend it. For example, starting from ``rose is'' we can extend rightward to
``rose is a'', then ``rose is a rose'' and leftward to ``a rose is a''. We
cannot extend leftward again to ``is a rose is a''; each bigram token can be
used only once per stick. (In practice, each sentence is wrapped in a $\Sbeg$
and $\Send$, and we add without loss of generality ``$\Dbeg$ $\Sbeg$'' and
``$\Send$ $\Dend$'' to every bigram BOW to avoid problems in certain cases.)

Once we have the sticks, we can do \Astar search. Each successor is created by
appending words from sticks drawn from the remaining BOW. (Overlap is
discarded.) Path cost is calculated as before. No heuristic is used.

% \subsection{DNA sequencing}

% \begin{itemize}
% \item stick = read = \textbf{fragments}
% \item encoding/decoding
% \item naive baseline
% \end{itemize}

\section{Experiments}
\label{sec:experiments}

\subsection{Datasets}

\begin{itemize}
\item How did we create them
\item why these domains
\item Example sentences
\item table of statistics
\end{itemize}

\subsection{Evaluation metric}

\begin{itemize}
\item BLEU 3 or 4?
\item 
\begin{tabular}{c|ccc}
  & dom1 & dom2 & ... \\
\hline
 1& 0.9  &      & ... \\
 2& 0.9  &      & ... \\
 5& 0.9  &      & ... \\
10& 0.9  &      & ... \\
20& 0.9  &      & ... \\
\end{tabular}
\end{itemize}

\section{Discussions}
\label{sec:discussions}

missing:
\begin{itemize}
\item syntactic (parser)
\item binary bow
\item stopwords missing
\end{itemize}

\bibliographystyle{acl}
\bibliography{docrecovery}

\end{document}
