---------------------------------------------------------------------
CS 577 (Intro to Algorithms)
Lec 21 (11/16/06) Shuchi Chawla
---------------------------------------------------------------------
Hashing
=======
Basic setting: We want to form a dictionary of "legal words" out of a
large collection of possible words. For example, suppose we want to
form a dictionary of English words of up to 7 characters each. There
are nearly 26^10 ~ 10^14 possible such words. But the number of legal
words out of these is quite small, say 1000.
We want the dictionary to satify the following desiderata:
- Small size (ideally proportional to the number of legal words, not
the entire universe.)
- Fast lookup, insertion and deletion (ideally in constant time)
There are a few different ways of solving this. For example, we could
maintain a balanced search tree. But then lookups could take time up
to logarithmic in the size of the dictionary. If we were not concerned
about the size of the dictionary, we could maintain an array of the
size of the entire universe, with a 1 in each position that
corresponds to a legal word, and 0 everywhere else. Then we would get
constant time lookup. Can we get the best of both? Hashing provides an
answer.
FORMAL SETUP
------------
- Elements come from some large universe U. (e.g, all < 10-character strings)
- Some set S in U of keys we actually care about (which may be static
or dynamic). N = |S|.
- do inserts and lookups by having an array A of size M, and a
HASH FUNCTION h: U -> {0,...,M-1}. Given element x, store in A[h(x)].
[if U was small (like 3-character strings) then you could just store
x in A[x]. But the problem is that U is big. That's why we need the
hash function.]
- Will resolve collisions by having each entry in A be a linked list.
Collision is when h(x) = h(y). There are other methods but this
is cleanest -- called "separate chaining". To insert, just put at top
of list. If h is good, then hopefully lists will be small.
Nice properties: searches are easy: compute index i=h(x) and then
walk down list at A[i] until we find it. Inserts and deletes are easy too.
Properties we want:
- Elements are spread out. Ideally all buckets (lists) of constant size.
- M is O(N)
- h is fast to compute. In analysis today we'll be ignoring time to
compute h, viewing it as constant. But worth remembering in the back
of your head that h shouldn't be crazy.
Given this, time to do a lookup for item x is O(length of list at
h(x)). Same for delete. Insert just takes time O(1). So, important
thing is to be able to analyze how big these lists get.
Basic intuition: one way to spread things out nicely is to spread them
randomly. Unfortunately, we can't just use a random number generator
to decide where the next thing goes because need h to be able to do
lookups: h(x) has to be the same when we lookup as it was when we
inserted. So, we want something "pseudorandom".
UNIVERSAL HASHING
-----------------
A probability distribution H over hash functions is *universal* if for
all x != y in U,
Pr [ h(x) = h(y) ] <= 1/M
h<-H
Theorem: if H is universal, then for any set S in U, for any x in U
(that we might want to insert or lookup), for a random h taken from H, the
*expected* number of collisions x has with (other) elements in S is <= N/M
Proof: Each y in S (y!=x) has <= 1/M chance of colliding with x. So,
- Let C_xy = 1 if x and y collide and 0 otherwise.
- C_x = total # collisions for x = sum_y C_xy.
- E[C_xy] = Pr(x and y collide) <= 1/M
- E[C_x] = sum_y E[C_xy] <= N/M
Corollary: the expected time for any given lookup is O(1 + N/M), since
lookup time is proportional to the length of the list (# of
collisions) plus the time to compute h (which we are viewing as
constant). So, overall, this is constant time if M=N.
Theorem: The expected total number of collisions in S is (N choose 2)/M
Proof: Let Z_xy be an indicator variable indicating that x collides
with y. Then, Pr[Z_xy = 1] = 1/M. Summing over these random variables,
we get that
E[total number of collisions] = E[sum over x&y of Z_xy]
= sum over x and y of E[Z_xy] = (N choose 2)/M
Note that we used the linearity of expectation in the second step.
Now suppose we picked M to be O(N^2), then we would obtain only a
constant number of collisions and the expected lookup/insert time
would be a constant. Note that this is like the birthday paradox -- in
a group of at least 27 people, it is very likely that two of them
share their birthday. Precisely, if you throw N objects randomly into
an array of size N^2/2, then in expectation you get one collision.
Question: can we actually construct a universal H? If not,
this this is all pretty vacuous. Answer is yes.
TERMINOLOGY: If H is a uniform distribution over a set of hash
functions {h1, h2, ...}, then that set is called a "UNIVERSAL HASH
FAMILY". Use "H" for both the set and the probability distribution.
Either way, we think of H as a probabilistic way of constructing a
hash function.
HOW TO CONSTRUCT: Matrix method
-----------------
Say keys are u-bits long. Say table size is power of 2, so an index is
b-bits long with M = 2^b.
What we'll do is pick h to be a random b-by-u 0/1 matrix, and define
h(x) = hx, where we do addition mod 2. These matrices are short and
fat. For instance:
h x h(x)
--------- - ---
[1 0 0 0] 1 1
[0 1 1 1] 0 = 1
[1 1 1 0] 1 0
0
Claim: For x!=y, Pr[h(x) = h(y)] = 1/M = 1/2^b.
Proof: First of all, what does it mean to multiply h by x? We can
think of it as adding some of the columns of h (mod 2) where the 1
bits in x tell you which ones to add. (e.g., we added the 1st and 3rd
columns of h above)
Now, take arbitrary x,y, x!=y. Must differ someplace, so say they differ
in ith coordinate and for concreteness say x_i=0 and y_i=1.
Imagine we first choose all of h but the ith column. Over remaining
choices of ith column, h(x) is fixed. But, each of the 2^b different
settings of the ith column gives a different value of h(y) [every time
we flip a bit in that column, we flip the corresponding bit in h(y).]
So there is exactly a 1/2^b chance that h(x) = h(y).
There are other methods based on multiplication modulo primes too. See
the textbook (section 13.6) for an example.
=====================================================================
Question: if we fix the set S, can we find hash function h such that
*all* lookups are constant-time? Yes. This leads to...
PERFECT HASHING
---------------
Hash function is "perfect" for S if all lookups involve O(1) work.
METHOD #1
---------
Say we are willing to have a table whose size is quadratic in the size
N of our dictionary S. Then, here is an easy method. Let H be
universal and M = N^2. Pick random h from H and hash everything in S.
Claim: Pr(no collisions) >= 1/2.
[So, we just try it, and if we got any collisions, we just try a new h]
Proof:
- how many pairs (x,y) in S are there? {N \choose 2}
- for each pair, the chance they collide is <= 1/M by defn of universal.
- so, Pr(exists a collision) <= {N \choose 2}/M < 1/2.
This is like the "other side" to the "birthday paradox". If the
number of days is a lot *more* than (# people)^2, then there is a
reasonable chance *no* pair has the same birthday.
What if we want to use just O(N) space?
METHOD #2:
---------
History: This was a big open question -- posed as "should tables be
sorted" -- for a fixed set can you get constant lookup time with only
linear space? Was followed by more and more complicated attempts,
finally solved using nice idea of universal hash functions in 2-level
scheme.
Proposal: hash into table of size N. Will get some collisions.
Then, for each bin, rehash it using Method #1, squaring the size of
the bin to get zero collisions.
E.g., if 1st hash function maps {a,b,c,d,e,f} to
[--,{a}, {b,c}, {d,e,f}],
then final result would look something like:
[--,h_2, h_3, h_4]
h_2 would just be the function h_2(x) = 0. h_3(x) would be a function
to table of size 4, h_4 would be function to table of size 9. Point
is: using Method #1, we can find h_i with no collisions by picking
from universal H and trying it -- if it doesn't work we try again
(each time, Pr(success)>= 1/2).
By *design* this is O(1) time for lookups, since just compute two hash
functions. Question is: how much space does it use?
Space used is O(N) for the 1st table (assume it takes a constant
amount of space to store each hash function), plus O(sum of |B_i|^2)
for the other tables (B_i is ith bucket). So, all we need to do is prove:
Theorem: if pick initial h from universal set H, then
Pr[sum of |B_i|^2 > 4*N] < 1/2.
Let us first do a back of the envelope computation for this. Recall
that when M=O(N), we expect to see (N choose 2)/M or around O(N)
collisions. Suppose all these collisions were happening in a single
bin of the array. How many elements does that bin have? About the
square root of N, because every pair of them collides. So B_i for this
bin is O(square root of N). So we get (B_i)^2 = O(N). In general the
collisions would be spread out, but this same argument essentially
allows us to bound the sum of the |B_i|^2.
Proof: We'll prove this by showing that E[sum_i |B_i|^2] < 2*N
Now, the neat trick is that one way to count this quantity is to
count the number of ordered pairs that collide, including element
colliding with itself. E.g, if bucket has {d,e,f}, then d collides
with each of d,e,f; e collides with each of d,e,f; f collides with
each of d,e,f; so get 9.
So,
sum_i |B_i|^2 = # of ordered pairs that collide (allowing x=y)
= sum_x sum_y C(x,y) [C(x,y)=1 if in same bucket, else 0]
So,
E[sum_i |B_i|^2] = sum_x sum_y Pr(x and y are in same bucket)
<= N + N(N-1) * (1/M), where the "1/M" comes from
definition of universal.
< 2*N. (since we set M = N)
So we get that the expected value of the sum is small. Let us use this
to show that with a high probability the sum is small.
Recall that the expected value of a random variable X is just the sum
over the possible values it takes times the probabilities that it
takes those values. Suppose that X only takes on positive values. What
is the probability that X takes a value thats twice as large as its
expected value? In other words, suppose that E[X] = y. What is
Pr[X>2y]? Well, it can't be more than 1/2, because otherwise, in the
sum to compute the expectation, these values alone will contribute
more than y. Therefore, Pr[X > 2E[X]] < 1/2. This is called Markov's
inequality.
Using Markov's inequality and E[sum_i |B_i|^2] < 2*N, we get that
Pr[sum of |B_i|^2 > 4*N] < 1/2.
So, final "hash function" is
1. compute i = h(x)
2. Now compute h_i(x) and use to lookup in table T_i.
Total space used is O(sum of (B_i)^2) = O(N).
============================================================================