\documentclass[11pt]{article}
\include{lecture}
\begin{document}
\newcommand{\order}[1]{\left| #1 \right|}
\newcommand{\norm}[2]{\| #1 \|_{#2}}
\lecture{16}{11/1/2011}{Error Reduction}{Aaron Gorenstein}
%\draft
Last lecture we introduced expanders, graphs with few edges that act
like graphs with many edges.
We discussed algebraic and combinatorial definitions, and introduced
how expanders are used to conserve a resource in $\BPP$ computation:
the number of random bits used in error reduction.
This lecture we review the definitions and establish their equivalence.
We construct a class of expander graphs, and apply those to implement
two forms of error reduction.
One form is entirely deterministic---after the first run uses $r$ random
bits, it does not use any more but can still reduce the error probability.
Another uses a few random bits per additional run, and greatly reduces
the error probability.
\section{Expander review: two definitions and equivalence}
\paragraph{Combinatorial Definition} $G = (V, E)$ is a $(k, c)$-expander if
$\forall B\subseteq V, \order{B} \leq k$, each such $B$ is also
$c$-expanding, meaning $\order{\Gamma(B)} \geq c\order{B}$.
Recall that $\Gamma(B)$ is the set of neighbors of the vertices in $B$.
\paragraph{Algebraic Definition} If we represent the $d$-regular graph $G$
with the normalized adjacency matrix $A$, we define the $2^{nd}$ largest
eigenvalue as $\lambda(G) = \mathrm{MAX}(\{|\lambda| \mid \exists \text{ an
eigenvector } v \text { of } A, Av = \lambda v \land v \perp u\})$.
\paragraph{Linking the two definitions}
Recall we define $\order{V} = N$.
Consider the two theorems, which we will refrain from proving:
\begin{theorem} \label{16:thm:expander1}
If $G$ is an $(\frac{N}{2}, c)$-expander, then $\lambda(G) \leq f(c, d)$,
where $f(c, d) < 1$ for $c > 1$.
\end{theorem}
\begin{theorem} \label{16:thm:expander2}
If $\lambda(G) \leq \lambda$, then $G$ is an $( \frac{N}{2}, c )$-expander
where $c \geq g(\lambda)$ and $g(\lambda) > 1$ if $\lambda < 1$.
\end{theorem}
Consider: then for a family of $d$-degree graphs, we have the following
statement:
\[
\exists c_0 > 1 \text{ s.t. } c \geq c_0 \Leftrightarrow
\exists \lambda_0 < 1 \text{ s.t. } \lambda \leq \lambda_0
\]
In other words, we can bound away any $c$ above 1 if and only if we can
bound any $\lambda$ below 1, for $c$ relating to the combinatorial
definition, and $\lambda$ the algebraic.
\section{Explicit Construction}
For all their desirable properties, do expanders actually exist?
If we randomly make a graph of fixed degree $d$ it is usually an expander.
But that thoroughly defeats our purpose: \emph{conserving} random bits.
We would like an \emph{explicit, deterministic} construction of expanders.
Particularly, given an index of a vertex, we would like to compute the
neighbors of that vertex in $\polylog(n)$ time.
Here is an example based off of pairs of integers modulo $m$.
\begin{example}
$V = \mathbb{Z}_m \times \mathbb{Z}_m$, so a single vertex $v = (x, y)
\mod m$.
Each vertex has exactly 8 neighbors:
\begin{eqnarray*}
(x \pm y, y) & (x \pm (y+1), y)\\
(x, y \pm x) & (x, y \pm (x+1))
\end{eqnarray*}
\end{example}
\section{Application to Error Reduction}
We consider two applications of expanders to error reduction.
The first is \textbf{deterministic error reduction}, in which we do not
use any more random bits than the original $r$, but repeat our $\BPP$
computation again with deterministically-computed neighbors of $r$, and we
prove that we can still reduce the error from its initial $\epsilon_0$.
The general strategy is to view all random bit sequences $r$ as vertices
of an expander graph.
Given the vertex $r$, we compute its neighbors (deterministically) and
try those. The majority vote determines if we accept or reject.
The second is \textbf{randomness-efficient error reduction}.
It is quite similar to the deterministic method, but instead of choosing
all neighbors, we take a short \emph{random walk} from our starting point
$r$.
For a walk of length $t$, this requires only $O(t)$ more random bits,
because we have a fixed-degree graph.
This is how we can easily get to error $2^{-k}$ using only $r + O(k)$ random
bits. Again, we decide based on the majority vote of those polled.
Before we move on, let us consider two properties of these expanders.
\begin{enumerate}
\item $\lambda(G)$ (the second-largest eigenvalue) dictates how quickly
it converges to the uniform distribution.
Recall from the previous lecture:
$\norm{A^tp - u}{1} \leq \sqrt{N}\lambda^t$.
This formula expresses that idea.
\item There is also the \emph{expander mixing lemma}.
The term ``mixing'' relates to Markov chains.
\end{enumerate}
\begin{lemma}[Expander Mixing Lemma] \label{16:lemma:eml}
For every pair of subsets $S,T \subseteq V$,
\begin{equation}
\left|\frac{|E(S,T)|}{dN} - \mu(S)\mu(T)\right|
\le \lambda \sqrt{\mu(S)(1-\mu(S))\mu(T)(1-\mu(T))}.
\end{equation}
\end{lemma}
The term $E(S,T)$ is the number of edges between the vertices subsets $S,T$.
Recall that $\mu(S) = \frac{\order{S}}{\order{V}}$.
In a sense, this lemma establishes a measure for how ``good'' our expander is.
If the right-hand value is small (so the difference is small), that means
choosing just \emph{one} random vertex and a random neighbor is
``almost as good as'' choosing \emph{two} random vertices.
Recall that with a constant degree, the number of random bits to choose a
random neighbor is much less than to choose an entirely new vertex.
In other words, $2\log\order{V} > \log\order{V} + \log d$.
Now we must establish such a bounds.
\begin{proof}{Expander Mixing Lemma}
We must translate $\order{E(S, T)}$ into algebraic terms.
Let $A(G)$ be the normalized adjacency matrix of $G$.
% the rest of this is a modified version of Hittson et. al (see
% acknlowedgements)
Recall that $A(G)$ is symmetric and real implying that it has a full
orthonormal eigenbasis. The number of edges between two sets $S$
and $T$ can be written in terms of their relative characteristic
vectors (i.e. $\chi_{S,i} = 1$ iff $v_i \in S$ and $\chi_{S,i} = 0$
otherwise):
\begin{equation} \label{13.eqn.eml.1}
\order{E(S, T)} = \chi_S^T(dA)\chi_T
\end{equation}
By definition of $A$, $dA$ is the standard adjacency matrix for $G$.
Both $\chi_S$ and $\chi_T$ can be rewritten in terms of their
components parallel and perpendicular to the uniform vector $u$.
Recall $u = (\frac{1}{N},\frac{1}{N},...,\frac{1}{N})$ is an
eigenvector corresponding to the eigenvalue 1.
\begin{eqnarray*}
\chi_T &=& \chi_T^\parallel + \chi_T^\perp\\
(\chi_T,u) &=& (\chi_T^\parallel,u) + (\chi_T^\perp, u)\\
\chi_T^\perp \perp u &\to& (\chi_T^\perp,u) = 0\\
\chi_T^\parallel \parallel u &\to& \chi_T^\parallel = \alpha \cdot u\\
%\mbox{We would like to determine $\alpha$.}\\
(\chi_T^\parallel,u) &=& \alpha (u,u)\\
\frac{\order{T}}{N} &=& \alpha\frac{1}{N}\\
\alpha &=& \order{T}\\
\mbox{So:}\\
\chi_T &=& \chi_T^\parallel + \chi_T^\perp\\
%\mbox{Filling in the equivalent values we've determined}\\
\chi_T &=& \order{T}\cdot u + \chi_T^\perp
\end{eqnarray*}
Now we have $A\chi_T = \order{T}u + A\chi_T^\perp$.
If we repeat this process for $\chi_S$ we can
re-write our formula for $\order{E(S, T)}$:
\begin{equation}
\begin{aligned}
\order{E(S,T)} &= (\chi_S^\parallel + \chi_S^\perp)(dA)(\chi_T^\parallel
+ \chi_T^\perp) \\
%&= \chi_S^\parallel dA + \chi_S^\perp dA + dA\chi_T^\parallel+dA\chi_T^\perp\\
&= \chi_S^\parallel dA \chi_T^\parallel + \chi_S^\perp dA
\chi_T^\perp \\
&= d \chi_S^\parallel \chi_T^\parallel + \chi_S^\perp dA
\chi_T^\perp \\
&= d \frac{\order{S}\order{T}}{N} + \chi_S^\perp dA \chi_T^\perp. \\
\end{aligned}
\end{equation}
The second and third lines follow from the first because
$\chi^\parallel$ is an eigenvector of $A$ causing the first term to
simplify.
The cross terms vanish because, for example, $A\chi_T^\perp$ creates
a vector that's orthogonal to $u$, and so the inner product is 0.
Dividing by $dN$, moving
terms around and taking the absolute value gives:
\begin{equation}
\begin{aligned}
\left|\frac{\order{E(S,T)}}{dN} - \mu(S)\mu(T)\right| &=
\left|\frac{\chi_S^\perp(dA)\chi_T^\perp}{dN}\right| \\
&\le \frac{\|\chi_S^\perp\|_2 \cdot \lambda \cdot
\|\chi_T^\perp\|_2}{N} \\
\end{aligned}
\end{equation}
The second line is reached by applying Cauchy-Schwarz to the RHS and
using the fact that $A$ decreases the magnitude of $\chi_T^\perp$ by
at least $\lambda$ as there is no component of $\chi_T^\perp$ along
$u$. Applying the Pythagorean theorem and some
simple algebra to $\|\chi_S\|_2$ we can derive the value of
$\|\chi_S^\perp\|_2$:
\begin{equation}
\begin{aligned}
\|\chi_S\|_2^2 &= \|\chi_S^\parallel\|_2^2 +
\|\chi_S^\perp\|_2^2, \rm{~therefore}\\
|S| &= |S|^2 \frac{1}{N} + \|\chi_S^\perp\|_2^2, \rm{~and} \\
\|\chi_S^\perp\|_2^2 &= |S|(1-\mu(S)), \\
\|\chi_S^\perp\|_2 &= \sqrt{|S|(1-\mu(S))}. \\
\end{aligned}
\end{equation}
Substituting this back in for $\|\chi_S^\perp\|_2$ and
$\|\chi_T^\perp\|_2$ and pulling the factor of $N$ into the
square root completes the proof.
\end{proof}
\section{Deterministic Error Reduction}
\begin{theorem}[Deterministic Error Reduction]
Given a $\BPP$ algorithm that uses $r$ random bits on input $x$ we can
reduce the error to $\leq \epsilon$ using $\poly(\frac{1}{\epsilon})$
runs of the original algorithm and $r$ random bits.
\end{theorem}
Recall our new procedure: run $M$ on all $d$ neighbors of our random
bit string of length $r$.
Consider the set of failure nodes who have over half
of their neighbors also as failure nodes.
As we consider the majority vote for our new decision, those are
exactly our ``failure'' nodes.
The number of such nodes over the total size of the graph is our failure
rate, our new $\epsilon$.
\begin{proof}
Let $B$ denote the set of all ``bad'' random strings. Formally:
\begin{equation}
B = \{\rho \in \{0, 1\}^r | M(x, \rho) \text{ Errs on input $x$}\}
\end{equation}
Note that the inner computation is deterministic---$\rho$ represents our
random bits.
The probability of our method failing is:
\begin{equation}
B'(x) = \{\rho \in \{0,1\}^r | \text{ over } \frac{d}{2} \text{ of } N(\rho)
\text{ errs on } x\}
\end{equation}
In other words, we will fail only if over half of the neighbors of $\rho$
are ``bad.''
As our algorithm is in $\BPP$, we know that $\mu(B) \leq \epsilon_0$
(the fraction of the bad strings is directly proportional to the
original error).
If our goal is to reduce the error to less than $\epsilon$, then $\mu(B')
\leq \epsilon$.
Our error is the fraction of times a bad $\rho$ has over half of its
neighbors in $B'$.
We will then use the expander mixing lemma to place bounds on that value.
As we're considering the times when we're wrong, at least half of $B'$'s
neighbors are bad, hence $\order{E(B',B)}\geq\order{B'}\frac{d}{2}$.
We re-express our definition:
\begin{equation}
\frac{\order{E(B',B)}}{dN} \geq \frac{\order{B'}\frac{d}{2}}{dN}
= \frac{\mu(B')}{2}
\end{equation}
This serves to simplify the left-hand-side of the expander mixing lemma
application.
And if instead of considering $\rho$'s deterministically-chosen neighbors,
we chose a new random variable:
\begin{equation} \label{16:eqn:detprfmx2}
\mu(B')\mu(B) \leq \mu(B')\epsilon_0 \leq \mu(B')/2
\end{equation}
Now we can compare these two options with the expander mixing lemma,
noting that we do not need absolute values because of the last relation
in \ref{16:eqn:detprfmx2}.
\begin{equation}
\frac{\mu(B')}{2} - \mu(B')\epsilon_0 = \mu(B')\left(\frac{1}{2} - \epsilon\right)\\
\leq \lambda\sqrt{\mu(B')\mu(B)}\\
\end{equation}
We have a shorter right-hand-side than the full mixing lemma simply
because we know that this value could only be bigger.
With some manipulation:
\begin{align}
\mu(B')(\frac{1}{2}-\epsilon_0) \leq& \lambda\sqrt{\mu(B')\mu(B)}\\
\mu(B') \leq& \frac{\lambda^2\mu(B)}{(\frac{1}{2}-\mu(B))^2}\\
\leq& \frac{\lambda^2\epsilon_0}{(\frac{1}{2}-\epsilon_0)^2}
\end{align}
So we've bounded our new error.
We achieve our goal if
\begin{equation}
\frac{\lambda^2\epsilon_0}{(\frac{1}{2}-\epsilon_0)^2} \leq \epsilon
\end{equation}
When does that happen?
When
\begin{equation}
\lambda \leq \frac{\sqrt{\epsilon}(\frac{1}{2}-\epsilon_0)}{\sqrt{\epsilon_0}}
\end{equation}
Where $\lambda$ is our second-largest eigenvalue.
What if $\lambda$ is too large, though?
Well, we can decrease it as needed by simply powering $A$, which squares
(and so shrinks) $\lambda$.
We can determine the constant $t$ such that $\lambda^t$ is appropriately small,
and that $t$ is about $\log(\frac{1}{\epsilon})$, our initial claim.
\end{proof}
\section{Randomness-Efficient Error Reduction}
Can we shrink error even more if we allow for some extra random bits?
That is the motivation question behind this next theorem.
\begin{theorem}[Randomness-Efficient Error Reduction]
Given a $\BPP$ algorithm $M$ that uses $r$ random bits on input $x$, we
can reduce the error to $\leq \epsilon$ using $\log(\frac{1}{\epsilon})$
runs of the original algorithm and $r + O(\log \frac{1}{\epsilon})$
random bits.
\end{theorem}
Compare to the na\"{i}ve algorithm's cost to get those bounds:
still $\log(\frac{1}{\epsilon})$ runs, but it uses
$r \cdot O(\log\frac{1}{\epsilon})$ random bits.
So we reduce a multiplicative factor to an additive one.
The basic method behind this is choosing a random neighbor.
Here we see why expanders act like complete graphs: if every node is
completely connected, then choosing a random neighbor is
equivalent to choosing a new random string.
Expanders have a much smaller degree, but choosing a random neighbor is
``almost as good.''
Our actual method will be to do a length $\log\frac{1}{\epsilon}$ walk,
and for each node (indicating a new random string) re-run our original
algorithm, and once again take the majority vote.
To prove this, we need a key lemma.
\begin{lemma}[Key Lemma] \label{16:lemma:random_amp}
Let $P$ be a projection on those $\rho$ for which $M$ errs, then for any vector
$x$:
\begin{align}
\|PAx\|_2 \le \sqrt{\epsilon_0 + \lambda^2} \|x\|_2.
\end{align}
\end{lemma}
\begin{proof}
Consider the representation of $x = x^\parallel + x^\perp$. Then by the triangle
inequality:
\begin{align}
\|PAx\|_2 \le \|PAx^\parallel\|_2 + \|PAx^\perp\|_2. \label{16.eqn.2_norm_PAx}
\end{align}
Using the facts that the uniform distribution is invariant under $A$ and that
the bad set is small, we get:
\begin{align}
\|PAx^\parallel\|_2 &= \|Px^\parallel\|_2 \le \sqrt{\epsilon}
\|x^\parallel\|_2. \label{16.eqn.2_norm_PAx_para}
\end{align}
Using the facts that $P$ is a projection (and so does not increase the 2-norm),
and that $x^\perp$ contracts by at least $\lambda$, we
get:
\begin{align}
\|PAx^\perp\|_2 &\le \|Ax^\perp\|_2 \le \lambda \|x^\perp\|_2.
\label{16.eqn.2_norm_PAx_perp}.
\end{align}
Substituting (\ref{16.eqn.2_norm_PAx_para}) and (\ref{16.eqn.2_norm_PAx_perp})
back into the (\ref{16.eqn.2_norm_PAx}), we have:
\begin{align}
\|PAx\|_2
&\le \sqrt{\epsilon_0} \|x^\parallel\|_2 + \lambda \|x^\perp\|_2 \\
&= (\sqrt{\epsilon_0}, \lambda) \cdot \left(\|x^\parallel\|_2,
\|x^\perp\|_2\right)^\intercal \\
&\le \sqrt{\epsilon_0 + \lambda^2}\|x\|_2 \tag{Follows from Cauchy-Schwarz}
\end{align}
\end{proof}
\subsection{Using the lemma}
This lemma can be used to bound the error probability of $M'$. Consider the
probability that $M'$ errs. This is the same as the probability that at least
half of the steps in the random walk fall in the set where $M$ errs.
\begin{align}
\Pr[M' \text{ errs}]
&= \Pr[\text{At least } \frac{t}{2} \text{ of } \rho
\text{ fall in the set on which } M \text{ errs}] \\
&\le \sum_{B \subseteq [t], |B| \ge \frac{t}{2}} \Pr[(\forall i \in B)
i^{\text{th}} \text{ step lies in the bad set for } M]
\label{16.eqn.pr_upper_bound}\\
&= \sum_{B \subseteq [t], |B| \ge \frac{t}{2}} \|M_tA M_{t-1}A \ldots M_2A M_1A
M_0 u\|_1 \label{16.eqn.pr_as_matrix}\\
&\le \sum_{B \subseteq [t], |B| \ge \frac{t}{2}} \sqrt{2^r} \|M_tA \ldots M_1A
M_0 u\|_2 \\
&\le \sum_{B \subseteq [t], |B| \ge \frac{t}{2}} \sqrt{2^r}
\left(\sqrt{\epsilon_0 + \lambda^2}\right)^{|B|} \|u\|_2
\label{16.eqn.repeat_lemma}\\
&= \sum_{B \subseteq [t], |B| \ge \frac{t}{2}} \left(\sqrt{\epsilon_0 +
\lambda^2}\right)^{|B|} \\
&\le 2^t \cdot \left(\sqrt{\epsilon_0 + \lambda^2}\right)^{\frac{t}{2}} \\
&= (4 \sqrt{\epsilon_0 + \lambda^2})^{\frac{t}{2}} \le \epsilon.
\end{align}
Line (\ref{16.eqn.pr_upper_bound}) is an upper bounds the actual probability
because it is over counting the bad strings. This probability is rewritten as a
product of matrices in (\ref{16.eqn.pr_as_matrix}) where
\begin{align*}
M_i =
\left\{
\begin{array}{ll}
P & i \in B\\
I & \text{otherwise.}
\end{array}
\right.
\end{align*}
Line (\ref{16.eqn.pr_as_matrix}) follows as an equality with this reasoning:
$u$ is the initial state, $M_0$ ``kills'' the bad set, then $A$ means we take
another ``step,'' and so we repeat.
Line (\ref{16.eqn.repeat_lemma}) follows from repeated applications of Lemma
\ref{16:lemma:random_amp}. A constant number of iterations can decrease
$\sqrt{\epsilon_0 + \lambda^2}$ to less than $\frac{1}{4}$.
What if the left-hand-side of our conclusion is too large? What ways do we have
to decrease that value?
If $\epsilon_0$ is too large, we can use our previous, deterministic error
reduction to shrink it further.
If $\lambda^2$ is too large, we can power our graph (as we did in our
deterministic method in the case where $\lambda$ was too large,
and so shrink $\lambda$.
These methods are sufficient to allow $\epsilon$ to be small.
If $4\sqrt{\epsilon_0 + \lambda^2} < 1$, then walking for $t =
O(\log\frac{1}{\epsilon})$ steps will give error less than $\epsilon$. This
procedure uses $r$ random bits to pick the starting vertex, and $\log d$ bits
for each of the $\log\frac{1}{\epsilon}$ steps in the random walk for a total of
$r + O(\log\frac{1}{\epsilon})$ random bits for $M'$.
\section{Other Results}
A stronger result which considers the variance of random walks is
known a the Expander Chernoff Bound. It states that if you take a
random walk that the fraction of times the walk lands in the bad set
does not vary much from the expected number. Let $X_i$ be an
indicator variable that indicates the event that the $i^{\rm{th}}$
step lies in some set $B_i \subseteq V$. Then the probability that
the walk varies from the the expected number of steps in the bad sets
can be written as:
\begin{equation}
\Pr[\sum_i^t (X_i - \mu(B_i)) \ge at] \le e^{-b(1-\lambda)a^2t}.
\end{equation}
where $a \ge 0$ and $b$ is some universal constant. The probability
that the walk varies from expected for a constant $a$ decreases
exponentially as $t$ increases. This inequality reduces to the
standard Chernoff Bound if $G$ is a complete graph ($\lambda(G) = 0$
because rank($G$) = 1 and all the $X_i$ are independent).
\paragraph{Next Lecture}
We will consider space bounded derandomization.
\section{Acknowledgments}
Many thanks to Amanda Hittson, Nathan Collins, and Matthew Anderson, the
authors of Lecture 13 for Spring 2010, and Tyson Williams, author
of lecture 14 for Spring 2010.
\end{document}