---------------------------------------------------------------------
CS 577 (Intro to Algorithms)
Lec 11 (10/10/06) Shuchi Chawla
---------------------------------------------------------------------
Today we'll see two other examples of dynamic programming. These
problems arise mainly in computational biology but have a number of
other applications as well.
Example 1: Longest Common Subsequence
==========================
The problem
-----------
We are given 2 strings, S of length n and T of length m.
E.g., S = ABAZDC
T = BACBAD
We want to find the longest sequence of characters that appear
left-to-right (but not necessarily in a contiguous block) in both
strings. This is called the Longest Common Subsequence or LCS.
E.g., here, LCS has length 4: ABAD.
For instance, use in genetics: given two DNA fragments, the LCS gives
info about what they have in common and the best way to line them up.
Let's solve the LCS problem using Dynamic Programming. Subproblems:
look at LCS of prefix of S and prefix of T. For simplicity, we'll
worry first about finding *length* of longest and then modify to
produce the sequence.
So, here's the question: say LCS[i,j] is the LCS of S[1..i] with
T[1..j]. How can we solve for LCS[i,j] in terms of the LCS for
the smaller problems?
Case 1: what if S[i]!=T[j]? Then, the answer LCS has to ignore one
of S[i] or T[j] so LCS[i,j] = max(LCS[i-1,j], LCS[i,j-1]).
Case 2: what if S[i]==T[j]? Then the LCS of S[1..i] and T[1..j]
might as well match them up. If I gave you a CS that matched
S[i] to an earlier location in T, for instance, you could
always match it to T[j] instead. So, the solution is 1 +
LCS[i-1,j-1].
So, we can just do two loops, filling in the LCS using these rules.
Here's what it looks like pictorially:
| B A C B A D
--+------------
A | 0 1 1 1 1 1
B | 1 1 1 2 2 2
A | 1 2 2 2 3 3
Z | 1 2 2 2 3 3
D | 1 2 2 2 3 4
C | 1 2 3 3 3 4
Just fill out row by row, doing constant amount of work per entry, so
O(mn) time overall.
To find sequence: walk backwards through matrix to largest of left or
up. If both are < current, then go diagonally and output character.
(Or, could modify alg to fill matrix with sequences instead of numbers)
We've been looking at "bottom-up Dynamic Programming". Here's another
way of thinking about DP, that also leads to basically the same
algorithm, but viewed from the other direction. Sometimes this is
called "top-down Dynamic Programming".
----------------------------------------------------------------------
How can we implement the same algorithm recursively using a top-down
approach?
First let's try to solve this problem recursively without worrying
about the running time. Here is an attempt (arrays start at 1):
LCS(S,n,T,m)
{
if (n==0 || m==0) return 0;
if (S[n] == T[m]) result = 1 + LCS(S,n-1,T,m-1); // no harm in matching up
else result = max( LCS(S,n-1,T,m), LCS(S,n,T,m-1) );
return result;
}
Memoized version: start by initializing arr[i][j]=unknown for all i,j.
LCS(S,n,T,m)
{
if (n==0 || m==0) return 0;
if (arr[n][m] != unknown) return arr[n][m]; // <- added this line
if (S[n] == T[m]) result = 1 + LCS(S,n-1,T,m-1);
else result = max( LCS(S,n-1,T,m), LCS(S,n,T,m-1) );
arr[n][m] = result; // <- and this line
return result;
}
Running time is O(nm). Why?
-> we get to the second-to-last line at most n*m times
=> at most 2nm recursive calls.
=> O(nm) running time.
Comparing bottom-up and top-down: top-down (memoized) pays a penalty
in recursion overhead, but can be faster for problems where a
reasonable fraction of the subproblems never get examined at all, so
we can avoid computing them.
----------------------------------------------------------------------
Discussion and Extensions
=========================
Equivalent problem: "minimum edit distance", where the legal
operations are insert and delete. (E.g., unix "diff", where S and T
are files, and the elements of S and T are lines of text). The
minimum edit distance to transform S into T is achieved by doing
|S| - LCS(S,T) deletes and |T| - LCS(S,T) inserts.
In computational biology applications, often one has a more general
notion of sequence alignment. Many of these different problems all
allow for basically the same kind of DP solution.
See Sections 6.6 & 6.7 in the book for a discussion of the general
sequence alignment problem.
----------------------------------------------------------------------
Example 2: Pattern Matching
================
Problem: We are given two strings---a long string S of length n, and a
short string T of length m. We want to find all the occurences of T in
S.
Eg. S = banananobanano and T = nano. T occurs in S twice, starting at
S[4] and S[10]. (Note that we think of S as an array with index
starting at 0.)
We want to solve this problem in linear time.
First, let's think of a naive algorithm for solving this. For every
index value i, we could test to see if T occurs in S starting at
S[i]. The algorithm would look like the following:
for (i=0 to i=n-1) {
set flag = true
for (j=0 to j=m-1) {
if (S[i+j] != T[j]) set flag = false;
}
if (flag == true) output i
}
How long does this algorithm take? The outer loop executes n times and
the inner one takes O(m) time in each iteration. Therefore, the total
time is O(mn). This is already polynomial time, but for long strings
(say, S is a PDF document with 10,000 characters, and m is a sentence
with 50 characters) this can be quite slow.
Let's see what an execution of this algorithm looks like on the above
example. The grid below shows the comparisons done.
0 1 2 3 4 5 6 7 8 9 10 11 12 13
S: b a n a n a n o b a n a n o
i=0: X
i=1: X
i=2: n a n X
i=3: X
i=4: n a n o
i=5: X
i=6: n X
i=7: X
i=8: X
i=9: X
i=10: n a n o
i=11: X
i=12: n X
i=13: X
We can make two observations. One, the algorithm doesn't always do m
comparisons in each iterations of the outer loop. However, there can
certainly exist examples where it does O(m) comparisons in EVERY
iteration. So we cannot obtain an improvement in the running time by
simply improving our analysis.
Second, note that every time we find an instance of T "nano", we can
simply skip the next three steps in the outer loop, because we know
that for the next three values of i, we are not going to find a
match. Likewise, when we have matched the three letters "nan" and the
next one doesn't match (e.g. for i=2 above), we can skip the next
value of i (i=3 above); furthermore, for the following value of i (i=4
above), we already know that the first letter matches with the first
letter in T, so we can save some work by just resuming checking with
the second letter in T.
Such savings can altogether lead to a considerable improvement in
running time.
More formally, suppose that we have matched T up to the j-th position
with S[i]..S[i+j]. Suppose also that S[i+j+1] != T[j+1]. Then how much
work can we save in the next iteration? Suppose that a prefix of T of
length x matches a suffix of T[0..j], i.e. T[0..x] = T[j-x..j], then
we know that S[i+j-x .. i+j] = T[j-x..j] = T[0..x]. So we can continue
matching S[i+j+1] with T[x+1], i.e. set the next value of i to i+j-x
and j = x+1. Moreover, if T[0..x] is the longest prefix of T that is
also a suffix of T[0..j], then we can skip all values of i smaller
than i+j-x.
In the example above, suppose that i=2, and j=2. That is, we matched
S[2..4] (nan) with T[0..2] (nan). Then, note that T[0] (n) is a suffix
of T[0..2] (nan). So, S[4] matches with T[0], but no occurence of T
starts at i=3. So we can resume matching S with T at i=4 and j=1.
Our new and more efficient algorithm is as follows. We assume that we
have computed these overlap values for T---for every j, overlap[j] is
the maximum value of x= 0 and T[j] != T[x])
x = Find-overlap(T,x-1)
overlap[j] = x+1
We can speed up this algorithm via memoization. In particular, we
store the result to each recursive call in the array overlap. Another
way of writing this algorithm is as follows:
Iterative-find-overlap
overlap[-1] = -1
For j=0 to m-1 do
x = overlap[j-1]
while (x>=0 and T[j] != T[x])
x = overlap[x-1]
overlap[j] = x+1
Note that we simply replaced the recursive calls by the values
computed previously and stored in the array overlap.
As an exercise, try this algorithm on the following string
T = a b a b c a b a b a b c
You should get the result 0 0 1 2 0 1 2 3 4 3 4 5.
What is the running time of this algorithm? It looks like it may be
O(m^2)--there are m iterations and within each, the while loop may
execute up to m times.
A clever analysis shows that this algorithm is in fact linear time!
Note how the value of x proceeds over the execution of the
algorithm. We could have rewritten the algorithm slightly differently
as follows:
Iterative-find-overlap
overlap[-1] = -1; x = -1;
For j=0 to m-1 do
while (x>=0 and T[j] != T[x])
x = overlap[x-1]
overlap[j] = x+1; x = x+1;
This computes exactly the same values as before. Only, instead of
initializing x at the beginning of each iteration, we remember the
value we computed in the previous iteration.
Now note that x increases exactly m times -- once in each
iteration. However, every time the while loop executes, the value of x
decreases. However, since x never goes below its initial value of
-1, the number of decreases can be at most as large as the number of
increases, which is m. Therefore, the while loop executes at most m
times and the total running time of the algorithm is O(m).