--------------------------------------------------------------------- CS 577 (Intro to Algorithms) Lecture notes: string matching Shuchi Chawla --------------------------------------------------------------------- Pattern matching or string maching ================================== Problem: We are given two strings---a long string S of length n, and a short string T of length m. We want to find all the occurences of T in S. Eg. S = banananobanano and T = nano. T occurs in S twice, starting at S[4] and S[10]. (Note that we think of S as an array with index starting at 0.) We want to solve this problem in linear time. First, let's think of a naive algorithm for solving this. For every index value i, we could test to see if T occurs in S starting at S[i]. The algorithm would look like the following: for (i=0 to i=n-1) { set flag = true for (j=0 to j=m-1) { if (S[i+j] != T[j]) set flag = false; } if (flag == true) output i } How long does this algorithm take? The outer loop executes n times and the inner one takes O(m) time in each iteration. Therefore, the total time is O(mn). This is already polynomial time, but for long strings (say, S is a PDF document with 10,000 characters, and m is a sentence with 50 characters) this can be quite slow. Let's see what an execution of this algorithm looks like on the above example. The grid below shows the comparisons done. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 S: b a n a n a n o b a n a n o i=0: X i=1: X i=2: n a n X i=3: X i=4: n a n o i=5: X i=6: n X i=7: X i=8: X i=9: X i=10: n a n o i=11: X i=12: n X i=13: X We can make two observations. One, the algorithm doesn't always do m comparisons in each iterations of the outer loop. However, there can certainly exist examples where it does O(m) comparisons in EVERY iteration. So we cannot obtain an improvement in the running time by simply improving our analysis. Second, note that every time we find an instance of T "nano", we can simply skip the next three steps in the outer loop, because we know that for the next three values of i, we are not going to find a match. Likewise, when we have matched the three letters "nan" and the next one doesn't match (e.g. for i=2 above), we can skip the next value of i (i=3 above); furthermore, for the following value of i (i=4 above), we already know that the first letter matches with the first letter in T, so we can save some work by just resuming checking with the second letter in T. Such savings can altogether lead to a considerable improvement in running time. More formally, suppose that we have matched T up to the j-th position with S[i]..S[i+j]. Suppose also that S[i+j+1] != T[j+1]. Then how much work can we save in the next iteration? Suppose that a prefix of T of length x matches a suffix of T[0..j], i.e. T[0..x] = T[j-x..j], then we know that S[i+j-x .. i+j] = T[j-x..j] = T[0..x]. So we can continue matching S[i+j+1] with T[x+1], i.e. set the next value of i to i+j-x and j = x+1. Moreover, if T[0..x] is the longest prefix of T that is also a suffix of T[0..j], then we can skip all values of i smaller than i+j-x. In the example above, suppose that i=2, and j=2. That is, we matched S[2..4] (nan) with T[0..2] (nan). Then, note that T[0] (n) is a suffix of T[0..2] (nan). So, S[4] matches with T[0], but no occurence of T starts at i=3. So we can resume matching S with T at i=4 and j=1. Our new and more efficient algorithm is as follows. We assume that we have computed these overlap values for T---for every j, overlap[j] is the maximum value of x= 0 and T[j] != T[x]) x = Find-overlap(T,x-1) overlap[j] = x+1 We can speed up this algorithm via memoization. In particular, we store the result to each recursive call in the array overlap. Another way of writing this algorithm is as follows: Iterative-find-overlap overlap[-1] = -1 For j=0 to m-1 do x = overlap[j-1] while (x >= 0 and T[j] != T[x]) x = overlap[x-1] overlap[j] = x+1 Note that we simply replaced the recursive calls by the values computed previously and stored in the array overlap. As an exercise, try this algorithm on the following string T = a b a b c a b a b a b c You should get the result 0 0 1 2 0 1 2 3 4 3 4 5. What is the running time of this algorithm? It looks like it may be O(m^2)--there are m iterations and within each, the while loop may execute up to m times. A clever analysis shows that this algorithm is in fact linear time! Note how the value of x proceeds over the execution of the algorithm. We could have rewritten the algorithm slightly differently as follows: Iterative-find-overlap overlap[-1] = -1; x = -1; For j=0 to m-1 do while (x >= 0 and T[j] != T[x]) x = overlap[x-1] overlap[j] = x+1; x = x+1; This computes exactly the same values as before. Only, instead of initializing x at the beginning of each iteration, we remember the value we computed in the previous iteration. Now note that x increases exactly m times -- once in each iteration. However, every time the while loop executes, the value of x decreases. However, since x never goes below its initial value of -1, the number of decreases can be at most as large as the number of increases, which is m. Therefore, the while loop executes at most m times and the total running time of the algorithm is O(m).