---------------------------------------------------------------------
CS 577 (Intro to Algorithms)
Lecture notes: string matching Shuchi Chawla
---------------------------------------------------------------------
Pattern matching or string maching
==================================
Problem: We are given two strings---a long string S of length n, and a
short string T of length m. We want to find all the occurences of T in
S.
Eg. S = banananobanano and T = nano. T occurs in S twice, starting at
S[4] and S[10]. (Note that we think of S as an array with index
starting at 0.)
We want to solve this problem in linear time.
First, let's think of a naive algorithm for solving this. For every
index value i, we could test to see if T occurs in S starting at
S[i]. The algorithm would look like the following:
for (i=0 to i=n-1) {
set flag = true
for (j=0 to j=m-1) {
if (S[i+j] != T[j]) set flag = false;
}
if (flag == true) output i
}
How long does this algorithm take? The outer loop executes n times and
the inner one takes O(m) time in each iteration. Therefore, the total
time is O(mn). This is already polynomial time, but for long strings
(say, S is a PDF document with 10,000 characters, and m is a sentence
with 50 characters) this can be quite slow.
Let's see what an execution of this algorithm looks like on the above
example. The grid below shows the comparisons done.
0 1 2 3 4 5 6 7 8 9 10 11 12 13
S: b a n a n a n o b a n a n o
i=0: X
i=1: X
i=2: n a n X
i=3: X
i=4: n a n o
i=5: X
i=6: n X
i=7: X
i=8: X
i=9: X
i=10: n a n o
i=11: X
i=12: n X
i=13: X
We can make two observations. One, the algorithm doesn't always do m
comparisons in each iterations of the outer loop. However, there can
certainly exist examples where it does O(m) comparisons in EVERY
iteration. So we cannot obtain an improvement in the running time by
simply improving our analysis.
Second, note that every time we find an instance of T "nano", we can
simply skip the next three steps in the outer loop, because we know
that for the next three values of i, we are not going to find a
match. Likewise, when we have matched the three letters "nan" and the
next one doesn't match (e.g. for i=2 above), we can skip the next
value of i (i=3 above); furthermore, for the following value of i (i=4
above), we already know that the first letter matches with the first
letter in T, so we can save some work by just resuming checking with
the second letter in T.
Such savings can altogether lead to a considerable improvement in
running time.
More formally, suppose that we have matched T up to the j-th position
with S[i]..S[i+j]. Suppose also that S[i+j+1] != T[j+1]. Then how much
work can we save in the next iteration? Suppose that a prefix of T of
length x matches a suffix of T[0..j], i.e. T[0..x] = T[j-x..j], then
we know that S[i+j-x .. i+j] = T[j-x..j] = T[0..x]. So we can continue
matching S[i+j+1] with T[x+1], i.e. set the next value of i to i+j-x
and j = x+1. Moreover, if T[0..x] is the longest prefix of T that is
also a suffix of T[0..j], then we can skip all values of i smaller
than i+j-x.
In the example above, suppose that i=2, and j=2. That is, we matched
S[2..4] (nan) with T[0..2] (nan). Then, note that T[0] (n) is a suffix
of T[0..2] (nan). So, S[4] matches with T[0], but no occurence of T
starts at i=3. So we can resume matching S with T at i=4 and j=1.
Our new and more efficient algorithm is as follows. We assume that we
have computed these overlap values for T---for every j, overlap[j] is
the maximum value of x= 0 and T[j] != T[x])
x = Find-overlap(T,x-1)
overlap[j] = x+1
We can speed up this algorithm via memoization. In particular, we
store the result to each recursive call in the array overlap. Another
way of writing this algorithm is as follows:
Iterative-find-overlap
overlap[-1] = -1
For j=0 to m-1 do
x = overlap[j-1]
while (x >= 0 and T[j] != T[x])
x = overlap[x-1]
overlap[j] = x+1
Note that we simply replaced the recursive calls by the values
computed previously and stored in the array overlap.
As an exercise, try this algorithm on the following string
T = a b a b c a b a b a b c
You should get the result 0 0 1 2 0 1 2 3 4 3 4 5.
What is the running time of this algorithm? It looks like it may be
O(m^2)--there are m iterations and within each, the while loop may
execute up to m times.
A clever analysis shows that this algorithm is in fact linear time!
Note how the value of x proceeds over the execution of the
algorithm. We could have rewritten the algorithm slightly differently
as follows:
Iterative-find-overlap
overlap[-1] = -1; x = -1;
For j=0 to m-1 do
while (x >= 0 and T[j] != T[x])
x = overlap[x-1]
overlap[j] = x+1; x = x+1;
This computes exactly the same values as before. Only, instead of
initializing x at the beginning of each iteration, we remember the
value we computed in the previous iteration.
Now note that x increases exactly m times -- once in each
iteration. However, every time the while loop executes, the value of x
decreases. However, since x never goes below its initial value of
-1, the number of decreases can be at most as large as the number of
increases, which is m. Therefore, the while loop executes at most m
times and the total running time of the algorithm is O(m).