<<<<<<< .mine
%
% File acl08.tex
%
% Contact: nasmith@cs.cmu.edu

\documentclass[11pt]{article}
\usepackage{acl08}
\usepackage{times}
\usepackage{latexsym}

\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\usepackage{xspace}
\usepackage{url}

\usepackage{orange}

\setlength\titlebox{6.5cm}    % Expanding the titlebox

\newcommand{\Astar}{\mbox{A$^*$}\xspace}
\newcommand{\Sstar}{\mbox{S$^*$}\xspace}
\newcommand{\argmax}{\mathrm{arg}\,\max}
\newcommand{\argmin}{\mathrm{arg}\,\min}
\newcommand{\rembow}{\mathrm{Rem.~BOW}}
\newcommand{\bow}{\mathrm{BOW}}
\newcommand{\perm}{\mathrm{perm}}
\newcommand{\Sbeg}{\mathtt{<}\mathrm{S}\mathtt{>}}
\newcommand{\Send}{\mathtt{<}\mathtt{/}\mathrm{S}\mathtt{>}}

\title{Document Recovery from Bag-of-Word Indices}

%\author{Joakim Nivre \\
%  School of Mathematics and Systems Engineering \\
%  V\"{a}xj\"{o} University \\
%  SE-35195, V\"{a}xj\"{o}, Sweden \\
%  {\tt nivre@msi.vxu.se} \And
%  Noah A. Smith \\
%  Language Technologies Institute \\
%  Carnegie Mellon University \\
%  Pittsburgh, PA 15213, USA\\
%  {\tt nasmith@cs.cmu.edu}}

\date{}

\begin{document}
\maketitle
\begin{abstract}
\end{abstract}

\section{Introduction}

\orange{let's say "text processing software" first, be general.}
With the advent of desktop search engines such as Google Desktop and Mac OS X's Spotlight feature, computer users' personal files are increasingly being indexed and stored in alternative formats (e.g., inverted index files) to enable fast searches.
Despite the popularity of such tools,
\orange{how about "with the popularity comes security concerns"?} few users know where the index files are maintained and what security measures are used to protect them.
In this paper, we seek to answer the following question: If an index file were to end up in a malicious hacker\orange{party}'s hands, what could he or she learn about your original documents?

\orange{next, explicitly group the software into 2 categories: one where the index file is not actively distributed, these are the search indices.  the point is that the hacker has to actively hack into a machine to access the indices.  give examples, yes please include those lesser famous ones.  the other is the support vector type, where the indices are actually distributed "given to the hacker".  have we verified that svmlight and libsvm both keep support vectors' indices in their model file?  give references.  mention that the concern is that the training documents might be confidential, and people shouldn't assume that the model is secure.

i think the above motivational discussion should be longer, perhaps 3/4 to 1 page.  it's really important that we make a case.
}

Similarly, much software today uses text classifiers (e.g., spam filters, news readers) that may store reduced representations of prototypical documents.
For example, a support vector machine classifier learned by SVM-light or LIBSVM is represented by a set of support vectors (i.e., training examples that define the decision boundary).
While a classifier does not explicitly store documents, the support vectors often contain word counts that are sufficient for classification purposes.
Also, database management systems such as PostgreSQL include full text search capabilities that produce index files like those used in search engines.
All of the aforementioned software retains some record of which words appear in which documents, though explicit order information is removed.
Note this is similar (if not identical) to a collection of bag-of-word (BOW) vectors, where each vector represents a single document and contains counts or indicators of what words are present. 

\orange{rephrase to anonymize}
Recent work (cite our ACL paper) has shown that using only a large set of BOW vectors, a third party can learn a simple bigram model and recover some simple documents.
In this paper, we explore the use of external resources like a large language model trained on Web text to attempt to recover ordered documents by decoding BOW vectors.
We present a thorough study of this novel problem, which we call \emph{document recovery}.
Our goal is to understand the problem in more detail and discover what steps must be taken to prevent privacyleakage and  violations.
While most software today is careful to encrypt index files or obfuscate support vectors, it is important to understand the implications of these security measures being breached.
We also seek to gain insight in new methods of text compression or ways to restore deleted files when the index is available.

\orange{say something like "our basic approach is to find the most likely document from the index, which is implemented as hypothesis rescoring, where the hypotheses are documents that are consistent with the index.
we note that hypothesis rescoring itself is not novel, and has been an important component in e.g. MT and ASR.
However, our work has several novel contributions:"}
We summarize our main contributions as follows:
\begin{itemize}
\item We formulate the problem of document recovery and pose it as a significant new area for research.
\item We provide benchmark data sets based on documents in several domains 
where document recovery could be particularly detrimental (e.g., personal e-mail, government documents, medical records).
\item Our recovery algorithms illustrate a practical use of the large Google 1T Ngrams corpus and resulting ``stupid backoff'' language model.
\item We introduce novel \Astar search heuristics for attempting to recovery documents from BOW vectors and other reduced representations (e.g., \orange{indicator vectors,} vectors after stop word removal, bag-of-bigrams vectors).
\end{itemize}


\section{Related work}

(point out differences)

\begin{itemize}
\item Machine translation (hypothesis/n-best rescoring), speech recognition
\item Datamining, security (recover hidden knowledge)
\end{itemize}


\section{Methods}

Our problem is as follows: we seek to find $w^* = \argmax_{w_1^D\in
\perm(\bow)} p(w_1^D)$, where $D$ is the document length.  \textit{(Actually
this is not true. Should we formulate the problem in terms of BLEU? or edit
distance to the true doc, etc.)}

\subsection{Baseline}

First we implemented a greedy baseline: 
\[
w_i = \argmin_{v \in V \backslash w_1^{i-1}} -\log p(v|w_{i-4}^{i-1})
\]
for $i=1\dots D$.

\subsection{\Astar search}

We solve our problem using memory-bounded \Astar search. Each point in the
search space represents a partial sequence $w_1^l$, where $l$ is the length of
the partial sequence. The start state is ${\Sbeg}$. Path cost from ${\Sbeg}$ to
any $w_1^l$ is defined as
\[
g(w_1^l) = -\log p(w_1^l) = -\sum_{i=1}^l \log p(w_i|w_{i-4}^{i-1}),
\]
% -\log p(w_1) - \log p(w_2|w_1) - \dots - \log p(w_l|w_{l-4}^{l-1}), 
where $l$ is the length of the partial sequence. A stupid-backoff language
model is used to approximate the probabilities:
%
\begin{align*}
p(w_i|w_{i-k+1}^{i-1}) 
  \approx& \frac{c(w_{i-k+1}^i)}{c(w_{i-k+1}^{i-1})},\,\,\, \mathrm{if}\, c(w_{i-k+1}^i) > 0 \\
         & \alpha p(w_i|w_{i-k+2}^{i-1}),\,\,\,  \mathrm{otherwise,} 
\end{align*}
%
if $k > 1$, and $p(w_i) = c(w_i)/\sum_{v\in V} c(v)$ if $k = 1$. \textit{Should
I mention $\alpha$'s value here? Should I use $\approx$ in the unigram prob
equation? I think it actually *is* a probability, as is the $>0$ condition of
the n-gram equation. maybe I should use = for both.}

\subsection{Heuristic}

% In order to carry out the search efficiently we need a heuristic that is fast
% but still gives reasonably good results. Several heuristics we tried were
% either too slow to calculate or too far from the true cost to the goal.
% \textit{need better transition}

We seek an upper bound on:
\[
h^{\textrm{true}}(w_1^l) = \argmin_{w_{l+1}^D \in \,\perm(\bow \backslash w_1^l)} -\log p(w_{l+1}^D)
\]
One possibility,
\[
h_1(w_1^l) = -\sum_{v_5 \in \bow \backslash w_1^l} 
  \min_{v_1^4 \in \perm(\bow \backslash w_1^{l-4})} -\log p(v_5|v_1^4),
\]
is admissible but impractical, since (a) we need to compute the $\min$s for
every candidate $w_1^l$, and (b) $\perm(\bow \backslash w_1^{l-4})$ is a huge
set for small $l$. Problem (a) is avoided by
\[
h_2(w_1^l) = -\sum_{v_5 \in \bow \backslash w_1^l} 
  \min_{v_1^4 \in \perm(\bow)} -\log p(v_5|v_1^4),
\]
since we can precompute the $\min$ for all $v_5 \in V$. Problem (b) is solved
by
\[
h_3(w_1^l) = -\sum_{v_5 \in \bow \backslash w_1^l} \Sstar(v_5),
\]
where
\[
  \Sstar(v_5) = \min_{v_1^4 \in \perm(\bow)} S(v_5|v_1^4),
\]
and
\[
S(v_5|v_i^4) = \min(-\log p(v_5|v_i^4), \alpha S(v_5|v_{i+1}^4)),
\]
for $i = 1,\dots,4$, $S(v_5) = -\log p(v_5)$.  Heuristic $h_3$ is admissible,
since $S(v_5|v_1^4) \geq -\log p(v_5|v_1^4)$ for all $v_1^5$ \textit{(should I
give further proof?)}, and efficient, since $\Sstar$ can be computed for all
$v_5 \in V$ in one pass through a list of ngram counts:
%
\begin{enumerate}
\item 

solved by considering only permutations where $c(v_1^4,5) > 0$.


% \item $h(w_t) = \max_{v_{1:4} \subset (w_{t-4:t-1} \cup
%   \mathrm{Remaining~BOW})} p(w_t|v_{1:4})$. Good: closer than the next two $h$
%   to true cost. Bad: something like $O(|\mathrm{doc}|^4)$.


\begin{enumerate}
  \item Scan Google's 1T N-gram corpus and extract all entries where $u_i \in $
    BOW, $\forall i \in 1...N, \forall N \in 1...5$.
  \item Scan the extracted list. Add all N-grams, $\forall N \in 1...4$, to a
    map from N-grams to counts. (This is for the denominators in S.)
  \item Scan the extracted list again. For every N-gram $u_{1:N}$ in the
    extracted list, compute $S' = \alpha^{5-N}
    \frac{c(u_{1:N})}{c(u_{1:N-1})}$.  Compare $S'$ to $S^*(u_N)$, the best
    score for $u_N$ seen so far. If $S' > S^*(u_N)$, set $S^*(u_N) = S'$.
    (Initially set $S^*(u_N) = 0$.)
\end{enumerate}

We end up with $S^*(u_N) \forall N \in$ BOW. We can now use $S^*$ to
efficiently compute a heuristic for each node in our search as follows: $h =
\sum_{u \in \mathrm{Remaining~BOW}} \log S^*(u_N)$.


We are given a BOW and Google's 1T N-gram corpus. Each entry
in Google's corpus consists of an N-gram $u_{1:N}$ and its count $c(u_{1:N})$.
Recall that our scoring method is stupid backoff: $S(w_i|w_{i-k+1:i-1}) =
\frac{c(w_{i-k+1:i})}{c(w_{i-k+1:i-1})}$ if $c(w_{i-k+1:i}) > 0$, $\alpha
S(w_i|w_{i-k+2:i-1})$ otherwise. 

Observation: for any 5-gram $u_{1:5}$, its true score $S(u_5|u_{1:4}) \leq
S^* \equiv \max (
            \frac{c(u_{1:5})}{c(u_{1:4})},$ 
  $\alpha   \frac{c(u_{2:5})}{c(u_{2:4})},$ 
  $\alpha^2 \frac{c(u_{3:5})}{c(u_{3:4})},$ 
  $\alpha^3 \frac{c(u_{4:5})}{c(u_4)},$ 
  $\alpha^4 \frac{c(u_5)}{|\mathrm{corpus}|}
)$.
Why?  Suppose $c(u_{1:5}) > 0$: 
then $S(u_5|u_{1:4}) = \frac{c(u_{1:5})}{c(u_{1:4})}$, which is in the max. 
On the other hand, suppose $c(u_{1:5}) = 0$, but $c(u_{2:5}) > 0$.
In that case, we back off once, so $S(u_5|u_{1:4})$ = $\alpha S(u_5|u_{2:4})$ =
$\alpha \frac{c(u_{2:5})}{c(u_{2:4})}$, which is in the max.
Similarly, if $c(u_{1:5}) = c(u_{2:5}) = 0$, but $c(u_{3:5}) > 0$, then we back
off twice, so $S(u_5|u_{1:4})$ = $\alpha^2 S(u_5|u_{3:4})$ = $\alpha^2
\frac{c(u_{3:5})}{c(u_{3:4})}$, which is in the max. By similar reasoning, we
can show that the max includes $S(u_5|u_{1:4})$ if we back off 3 or 4 times. In
other words, the max operation includes the true score $S(u_5|u_{1:4})$ no
matter how many times we back off. 
On the other hand, $S^*$ is an upper bound on the true score: if $c(u_{1:5})>0$ then we are supposed to use $\frac{c(u_{1:5})}{c(u_{1:4})}$.
However, it could happen that other terms in the max is larger.
This is OK for our admissible heuristic function.

This observation suggests a procedure for efficiently calculating a heuristic.

\begin{enumerate}
  \item Scan Google's 1T N-gram corpus and extract all entries where $u_i \in $
    BOW, $\forall i \in 1...N, \forall N \in 1...5$.
  \item Scan the extracted list. Add all N-grams, $\forall N \in 1...4$, to a
    map from N-grams to counts. (This is for the denominators in S.)
  \item Scan the extracted list again. For every N-gram $u_{1:N}$ in the
    extracted list, compute $S' = \alpha^{5-N}
    \frac{c(u_{1:N})}{c(u_{1:N-1})}$.  Compare $S'$ to $S^*(u_N)$, the best
    score for $u_N$ seen so far. If $S' > S^*(u_N)$, set $S^*(u_N) = S'$.
    (Initially set $S^*(u_N) = 0$.)
\end{enumerate}

We end up with $S^*(u_N) \forall N \in$ BOW. We can now use $S^*$ to
efficiently compute a heuristic for each node in our search as follows: $h =
\sum_{u \in \mathrm{Remaining~BOW}} \log S^*(u_N)$.

Empirical heuristic. This is not admissible. Run \Astar search once and
collect, for each $w_5$ in vocab, the $max_{w_1^4} p(w_5|w_1^4)$, only looking
at $w_1^4$ you saw in your first search. Then you run \Astar again and use
those probabilities as the heuristic.

\subsection{Pruning strategy}

Even using \Astar search, the search space is huge and we usually run out of
memory before finding a solution. So we need to prune the priority queue of
candidates. One strategy is to prune candidates based on the same score we pop
candidates, but looking at the minimum score: $g+h$. However this favors short
sentences, since our imperfect heuristics substantially overestimate the true
path cost.

Our second strategy, (b) was $(g+h)/l$.  % (a),(b) are temporary labels

Our third strategy, (a) was $1000*l+g+h$.

These result in much better results than $g+h$. However, (b) is much slower
than (a) (and than $g+h$). Why? I think it's because (a) doesn't do (or does
hardly any) backtracking.  It expands a few nodes, bumps up against the heap
size, and then prunes the shortest candidates. Only once it's pruned all nodes
of length i-1 does it prune nodes of length i. (b) keeps more of these short
but potentially good nodes around, leaving open the possibility that it might
want to backtrack to them. So (a) is more like beam search, I guess.

For decoding long docs, (a) might be possible while (b) (or h4 without
our new pruning scheme) would be infeasible. Also, (a) is similar to
beam search which seems to be used more commonly in MT than A*,
because of its speed. 

On the other hand, (b) is more elegant.

% \begin{itemize}
% \item stupid backoff LM
% \item Goal:
% \[
% \argmax_{s\in perm(\bow)} p(s)
% \]
% \item ``greedy baseline''
% \item heuristic: h4, 2-pass h
% \item pruning: 1000*partial\_length+g+h, (g+h)/partial\_length
% \item heuristic
% \end{itemize}

% \subsection{DNA sequencing}

% \begin{itemize}
% \item stick = read = \textbf{fragments}
% \item encoding/decoding
% \item naive baseline
% \end{itemize}

\section{Experiments}

\subsection{Datasets}

\begin{itemize}
\item How did we create them
\item why these domains
\item Example sentences
\item table of statistics
\end{itemize}

\subsection{Evaluation metric}

\begin{itemize}
\item BLEU 3 or 4?
\item 
\begin{tabular}{c|ccc}
  & dom1 & dom2 & ... \\
\hline
 1& 0.9  &      & ... \\
 2& 0.9  &      & ... \\
 5& 0.9  &      & ... \\
10& 0.9  &      & ... \\
20& 0.9  &      & ... \\
\end{tabular}
\end{itemize}

\section{Discussions}
missing:
\begin{itemize}
\item syntactic (parser)
\item binary bow
\item stopwords missing
\end{itemize}

\begin{thebibliography}{}

\bibitem[\protect\citename{Aho and Ullman}1972]{Aho:72}
Alfred~V. Aho and Jeffrey~D. Ullman.
\newblock 1972.
\newblock {\em The Theory of Parsing, Translation and Compiling}, volume~1.
\newblock Prentice-{Hall}, Englewood Cliffs, NJ.

\bibitem[\protect\citename{{American Psychological Association}}1983]{APA:83}
{American Psychological Association}.
\newblock 1983.
\newblock {\em Publications Manual}.
\newblock American Psychological Association, Washington, DC.

\bibitem[\protect\citename{{Association for Computing Machinery}}1983]{ACM:83}
{Association for Computing Machinery}.
\newblock 1983.
\newblock {\em Computing Reviews}, 24(11):503--512.

\bibitem[\protect\citename{Chandra \bgroup et al.\egroup }1981]{Chandra:81}
Ashok~K. Chandra, Dexter~C. Kozen, and Larry~J. Stockmeyer.
\newblock 1981.
\newblock Alternation.
\newblock {\em Journal of the Association for Computing Machinery},
  28(1):114--133.

\bibitem[\protect\citename{Gusfield}1997]{Gusfield:97}
Dan Gusfield.
\newblock 1997.
\newblock {\em Algorithms on Strings, Trees and Sequences}.
\newblock Cambridge University Press, Cambridge, UK.

\end{thebibliography}

\end{document}
=======
%
% File acl08.tex
%
% Contact: nasmith@cs.cmu.edu

\documentclass[11pt]{article}
\usepackage{acl08}
\usepackage{times}
\usepackage{latexsym}

\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\usepackage{xspace}
\usepackage{url}
\urlstyle{same}

\usepackage{orange}

\setlength\titlebox{6.5cm}    % Expanding the titlebox

\newcommand{\Astar}{\mbox{A$^*$}\xspace}
\newcommand{\argmax}{\mathrm{arg}\,\max}
\newcommand{\rembow}{\mathrm{Rem.~BOW}}
\newcommand{\bow}{\mathrm{BOW}}

\title{Document Recovery from Bag-of-Word Indices}

%\author{Joakim Nivre \\
%  School of Mathematics and Systems Engineering \\
%  V\"{a}xj\"{o} University \\
%  SE-35195, V\"{a}xj\"{o}, Sweden \\
%  {\tt nivre@msi.vxu.se} \And
%  Noah A. Smith \\
%  Language Technologies Institute \\
%  Carnegie Mellon University \\
%  Pittsburgh, PA 15213, USA\\
%  {\tt nasmith@cs.cmu.edu}}

\date{}

\begin{document}
\maketitle
\begin{abstract}
\end{abstract}

\section{Introduction}

With the proliferation of text processing software such as Google Desktop, databases storing full-text indices, and spam filters, 
computer users' personal files are increasingly being indexed and stored in inverted index files or other representations (e.g., term frequency vectors) to enable fast searches and other operations.
With great popularity comes great responsibility, though: few users know where the index files are maintained and what security measures are used to protect them.
In this paper, we seek to answer the following question: If an index file were to end up in a malicious party's hands, what could he or she learn about your original documents?

We are interested in the problem of \emph{document recovery from index files}.
Informally, we define this as follows: given an index or collection of term frequency or bag-of-words (BOW) vectors, the task is to reconstruct the original \emph{ordered} text documents. 
Our goal is to understand this problem in more detail and discover what steps must be taken to protect individuals' privacy in the face of today's text processing needs.

We divide the sources of potential privacy risks into two categories.
In the first category, index files are not actively distributed to end users.
A third party must actively hack into a machine to access the indices,\footnote{Throughout this paper, we assume that encrypted index files could be decrypted using standard techniques.} 
at which point he or she may be able to discover the full contents of the original documents, even if word order was eliminated in the indexing process.
Specific examples of software in this first category are desktop search engines like Google Desktop and Mac OS X's Spotlight tool, as well as the popular open-source search engine Apache Lucene,\footnote{\protect\url{lucene.apache.org}} which provides the option to store only BOW vectors in its index. 
Also, some e-mail clients, such as \emph{sup}\footnote{\protect\url{sup.rubyforge.org}} and \emph{cone},\footnote{\protect\url{www.courier-mta.org/cone}} store local indices of messages stored on a secure server.
Finally, widely used database management systems like MySQL and PostgreSQL include full-text search capabilities that produce index files and term frequency vectors.
Note that in all these cases, users might assume the indices are less sensitive than the original documents, which may be stored elsewhere.
We hope to show that indices alone provide a powerful tool to access private information.

In the second category of document recovery scenarios, index files or document vectors are actually distributed to end users.
That is, they are \emph{given directly to potential hackers}. 
This is the case for classification tools, where someone has trained a model and wishes to distribute it to others to classify new documents. 
The concern is that the training documents might be confidential or private in nature, and people should not assume the model is secure. 
As a simple example, consider a (na\"ive) server administrator that builds a spam classifier using all e-mails marked as spam by all users, and then distributes the model to users' local mail clients.
Other example applications are news readers that categorize documents based on a previously seen training corpus, or sentiment analysis tools trained on potentially private product reviews. 
For a nearest neighbor classifier, the model is simply the set of training vectors.
More realistically, if the classifier is a support vector machine (SVM), the model contains the "support vectors," which are nothing more than actual training examples close to the decision boundary. 
If one reverse engineers the model, it is possible to recover the (important, support vector) documents that were used to train the model.
In fact, the model files produced by both SVM$^{light}$\footnote{\protect\url{svmlight.joachims.org}} and LIBSVM\footnote{\protect\url{www.csie.ntu.edu.tw/~cjlin/libsvm}} contain the support vectors in a readable ASCII format (even in the case of linear SVMs).
Thus, anyone provided with a model file has access to a potentially large set of training documents in some vector space.
We explore the case of BOW vectors and indicator vectors, as well as vectors with stop words removed.

All of the aforementioned software retains some record of which words appear in which documents, though explicit order information is removed.
Note this is similar (if not identical) to a collection of BOW vectors, where each vector represents a single document and contains counts or indicators of the words present. 
It has recently been shown that using only a large set of BOW vectors, a third party can learn a simple bigram model and recover some simple documents~\cite{Zhu2008Learning}. \orange{Is this anonymous enough?}.
In this paper, we explore the use of external language resources like a large n-gram model trained on Web text to attempt to recover ordered documents by decoding BOW vectors.
We present a thorough study of this problem of document recovery to gain insight into safeguarding private documents, as well as novel text compression methods.
Our methods may also be useful for restoring files when an index is available, but the original documents have been inadvertently deleted.

Our basic approach to document recovery is to find the most likely document given the index.
This is implemented as hypothesis rescoring, where the hypotheses are documents that are consistent with the index.
We note that hypothesis rescoring itself is not novel and has been an important component in many previous machine translation and automatic speech recognition systems.
However, our work has several novel contributions:
\begin{itemize}
\item A novel formulation of the problem of document recovery, which we believe is an important area for future NLP research.
\item Several benchmark data sets drawn from domains 
where document recovery could be particularly detrimental (e.g., personal e-mail, government documents, medical records).
\item A comprehensive empirical study using Google 1T 5gram and the resulting ``stupid backoff'' language model~\cite{Brants07}.
\item A specialized \Astar decode using novel search heuristics for document recovery under a variety of reduced document conditions (e.g., BOW vectors, indicator vectors, vectors after stop word removal, bag-of-bigrams vectors).
\end{itemize}

We discuss related work in Section~\ref{sec:related}, then present our formal definition of document recovery and our methods in Section~\ref{sec:methods}.
We introduce our benchmark data sets and present experimental results illustrating the validity of our methods in Section~\ref{sec:experiments}.
Finally, we conclude with discussions in Section~\ref{sec:discussions}.


\section{Related work}
\label{sec:related}
In the NLP domain, there has been a great deal of machine translation and automatic speech recognition research in hypothesis rescoring.
We only have room to mention a few salient examples here.
In statistical machine translation (SMT), various decoding strategies have been explored~\cite{Knight1999Decoding,Germann2003Greedy}. 
The basic task is to produce the best sequence of words given a set of input words, a translation model, and a language model.
This has been addressed using greedy best-first search without any search heuristics, and has been formulated as a traveling salesman problem (i.e., finding the lowest-cost path that visits all the required words)~\cite{Germann2001Fast}.
Beam search is also commonly used, at the word, phrase, or n-gram level~\cite{Hoang08,Crego2005Ngram}.
Finally, \Astar search, which we use in the current work, has been used in SMT. Och et al.~\cite{och2001} develop several admissible or almost admissible heuristics used for decoding in machine translation. 
Our work differs from all of the above in that we already know (at least some of) the words that must appear in the recovered document. While this may appear to simplify the problem, our plausible search space may in fact be much larger. 
In SMT, the possible permutations of target words is constrained based on the \emph{ordered} source language document~\cite{kanthak05novel,bangalore07}.
We lack such constraints in our task. We also consider other reduced document conditions where we must decide how many additional words to introduce without the guide of a source document.

In another NLP application, Athanaselis et al.~\cite{athanaselis06} search for reorderings of words in the context of automatic grammar checking and finding word order errors. The method uses a large language model (trained on the 100 million word British National Corpus) for guiding a filtering-based search process: the search space is pruned of permutations that contain non-existent bigrams. The goal is to maximize the probability of trigrams in the reordered sentence. Our work differs in that we use a higher-order (upto $n=5$) language model trained on several orders of magnitude more text, and we employ \Astar search to find the optimal (highest overall language model probability) ordered document. Our approach is also designed to work on much longer test documents, not just short sentences (4--12 words). 

\orange{I included this because there was a placeholder for it, but the connection to these topics is pretty tenuous?}
At a high level, the current research also bears similarity to data mining and security research that attempts to recover hidden knowledge from published sources.
Staddon et al.~\cite{Staddon07} study what inferences can be made when newly published data is combined with existing information on the Web or other public sources.
We also try to measure what can be learned by combining reduced document representations with large language models, though the methods involved are very different. 
Another task related to document recovery is that of uncovering missing words in declassified documents that have been redacted or sanitized. 
In this case, ordered text documents are provided, but some spans of text have been blacked out, and one must try to predict what words are missing.
This task was first popularized in the press when David Naccache and Claire Whelan demoed software at Eurocrypt 2004 in which they predicted blacked-out words in a White House briefing regarding September 11\footnote{\protect\url{www.nytimes.com/2004/05/10/technology/10crypto.html}}.
Again, this is different than our notion of document recovery, but it addresses similar questions of privacy leakage from obscured or modified text documents.


\section{Methods}
\label{sec:methods}
Our problem is as follows: $\argmax_{w_1^D\in \mathrm{perm}(\bow)} p(w_1^D)$,
where $D$ is the document length.


\subsection{Baseline}

First we implemented a baseline, which doesn't solve the problem: 

\begin{eqnarray}
w_1 &=& \argmax_{v \in V} p(w_1) \\
w_2 &=& \argmax_{v \in V \backslash \{w_1\}} p(w_2|w_1) \\
w_3 &=& \argmax_{v \in V \backslash w_1^2} p(w_3|w_1^2) \\
    & & ... \\
w_D &=& \argmax_{v \in V \backslash w_1^{D-1}} p(w_D|w_{D-4}^{D-3})
\end{eqnarray}

This gives terrible results.

\subsection{Formulation as \Astar search}

To get better results, we solve our original problem, $\argmax_{w_1^D}\in
\mathrm{perm}(\bow) p(w_1^D)$, using memory-bounded \Astar search. Inverse path
cost = $g$ = $p(w_1^l)$, where $l$ is the length of the partial sequence. We
calculate the probabilities using a stupid-backoff language model built on the
Google 1T corpus.

\subsection{Heuristic}

We tried several heuristics.

\begin{enumerate}
\item $h(w_t) = \max_{v_{1:4} \subset (w_{t-4:t-1} \cup
  \mathrm{Remaining~BOW})} p(w_t|v_{1:4})$. Good: closer than the next two $h$
  to true cost. Bad: something like $O(|\mathrm{doc}|^4)$.
\item $h(w_t) = \max_{v_{1:4} \subset V} p(w_t|v_{1:4})$. This can be
  precomputed for all $w_t \in V$. Good: fast. Bad: far from true cost.
\item $h(w_t) = \max_{v_4 \in (w_{t-4:t-1} \cup \mathrm{Remaining~BOW}),
  v_{1:3} \subset V} p(w_t|v_{1:4})$. We can precompute $\alpha(w_t, v_4) =
  \max_{v_{1:3} \subset V} p(w_t|v_{1:4})$, and use this to compute $h$
  efficiently. Good/Bad: Faster than first $h$, slower than second; closer to
  true cost than second $h$, farther from true cost than first.
\end{enumerate}

$h_4(w_t)$. We are given a BOW and Google's 1T N-gram corpus. Each entry
in Google's corpus consists of an N-gram $u_{1:N}$ and its count $c(u_{1:N})$.
Recall that our scoring method is stupid backoff: $S(w_i|w_{i-k+1:i-1}) =
\frac{c(w_{i-k+1:i})}{c(w_{i-k+1:i-1})}$ if $c(w_{i-k+1:i}) > 0$, $\alpha
S(w_i|w_{i-k+2:i-1})$ otherwise. 

Observation: for any 5-gram $u_{1:5}$, its true score $S(u_5|u_{1:4}) \leq
S^* \equiv \max (
            \frac{c(u_{1:5})}{c(u_{1:4})},$ 
  $\alpha   \frac{c(u_{2:5})}{c(u_{2:4})},$ 
  $\alpha^2 \frac{c(u_{3:5})}{c(u_{3:4})},$ 
  $\alpha^3 \frac{c(u_{4:5})}{c(u_4)},$ 
  $\alpha^4 \frac{c(u_5)}{|\mathrm{corpus}|}
)$.
Why?  Suppose $c(u_{1:5}) > 0$: 
then $S(u_5|u_{1:4}) = \frac{c(u_{1:5})}{c(u_{1:4})}$, which is in the max. 
On the other hand, suppose $c(u_{1:5}) = 0$, but $c(u_{2:5}) > 0$.
In that case, we back off once, so $S(u_5|u_{1:4})$ = $\alpha S(u_5|u_{2:4})$ =
$\alpha \frac{c(u_{2:5})}{c(u_{2:4})}$, which is in the max.
Similarly, if $c(u_{1:5}) = c(u_{2:5}) = 0$, but $c(u_{3:5}) > 0$, then we back
off twice, so $S(u_5|u_{1:4})$ = $\alpha^2 S(u_5|u_{3:4})$ = $\alpha^2
\frac{c(u_{3:5})}{c(u_{3:4})}$, which is in the max. By similar reasoning, we
can show that the max includes $S(u_5|u_{1:4})$ if we back off 3 or 4 times. In
other words, the max operation includes the true score $S(u_5|u_{1:4})$ no
matter how many times we back off. 
On the other hand, $S^*$ is an upper bound on the true score: if $c(u_{1:5})>0$ then we are supposed to use $\frac{c(u_{1:5})}{c(u_{1:4})}$.
However, it could happen that other terms in the max is larger.
This is OK for our admissible heuristic function.

This observation suggests a procedure for efficiently calculating a heuristic.

\begin{enumerate}
  \item Scan Google's 1T N-gram corpus and extract all entries where $u_i \in $
    BOW, $\forall i \in 1...N, \forall N \in 1...5$.
  \item Scan the extracted list. Add all N-grams, $\forall N \in 1...4$, to a
    map from N-grams to counts. (This is for the denominators in S.)
  \item Scan the extracted list again. For every N-gram $u_{1:N}$ in the
    extracted list, compute $S' = \alpha^{5-N}
    \frac{c(u_{1:N})}{c(u_{1:N-1})}$.  Compare $S'$ to $S^*(u_N)$, the best
    score for $u_N$ seen so far. If $S' > S^*(u_N)$, set $S^*(u_N) = S'$.
    (Initially set $S^*(u_N) = 0$.)
\end{enumerate}

We end up with $S^*(u_N) \forall N \in$ BOW. We can now use $S^*$ to
efficiently compute a heuristic for each node in our search as follows: $h =
\sum_{u \in \mathrm{Remaining~BOW}} \log S^*(u_N)$.

Empirical heuristic. This is not admissible. Run \Astar search once and
collect, for each $w_5$ in vocab, the $max_{w_1^4} p(w_5|w_1^4)$, only looking
at $w_1^4$ you saw in your first search. Then you run \Astar again and use
those probabilities as the heuristic.

\subsection{Pruning strategy}

Even using \Astar search, the search space is huge and we usually run out of
memory before finding a solution. So we need to prune the priority queue of
candidates. One strategy is to prune candidates based on the same score we pop
candidates, but looking at the minimum score: $g+h$. However this favors short
sentences, since our imperfect heuristics substantially overestimate the true
path cost.

Our second strategy, (b) was $(g+h)/l$.  % (a),(b) are temporary labels

Our third strategy, (a) was $1000*l+g+h$.

These result in much better results than $g+h$. However, (b) is much slower
than (a) (and than $g+h$). Why? I think it's because (a) doesn't do (or does
hardly any) backtracking.  It expands a few nodes, bumps up against the heap
size, and then prunes the shortest candidates. Only once it's pruned all nodes
of length i-1 does it prune nodes of length i. (b) keeps more of these short
but potentially good nodes around, leaving open the possibility that it might
want to backtrack to them. So (a) is more like beam search, I guess.

For decoding long docs, (a) might be possible while (b) (or h4 without
our new pruning scheme) would be infeasible. Also, (a) is similar to
beam search which seems to be used more commonly in MT than A*,
because of its speed. 

On the other hand, (b) is more elegant.

% \begin{itemize}
% \item stupid backoff LM
% \item Goal:
% \[
% \argmax_{s\in perm(\bow)} p(s)
% \]
% \item ``greedy baseline''
% \item heuristic: h4, 2-pass h
% \item pruning: 1000*partial\_length+g+h, (g+h)/partial\_length
% \item heuristic
% \end{itemize}

% \subsection{DNA sequencing}

% \begin{itemize}
% \item stick = read = \textbf{fragments}
% \item encoding/decoding
% \item naive baseline
% \end{itemize}

\section{Experiments}
\label{sec:experiments}

\subsection{Datasets}

\begin{itemize}
\item How did we create them
\item why these domains
\item Example sentences
\item table of statistics
\end{itemize}

\subsection{Evaluation metric}

\begin{itemize}
\item BLEU 3 or 4?
\item 
\begin{tabular}{c|ccc}
  & dom1 & dom2 & ... \\
\hline
 1& 0.9  &      & ... \\
 2& 0.9  &      & ... \\
 5& 0.9  &      & ... \\
10& 0.9  &      & ... \\
20& 0.9  &      & ... \\
\end{tabular}
\end{itemize}

\section{Discussions}
\label{sec:discussions}

missing:
\begin{itemize}
\item syntactic (parser)
\item binary bow
\item stopwords missing
\end{itemize}

\bibliographystyle{acl}
\bibliography{docrecovery}

\end{document}
>>>>>>> .r160