Locating Tandem Repeats

Debby Joseph (joseph@cs.wisc.edu) Amy Hauth (kryder@cs.wisc.edu)

This project involves developing software tools for locating repetitive regions within DNA se­ quences. Identifying and locating repeats will help biologists learn more about repetitive regions. In addition, computational analysis of DNA sequences becomes increasingly complex when repetitive DNA occurs in the sequences under analysis. Thus, identification of repetitive DNA is a first step towards enabling biologists to understand DNA sequences and computational biologists to solve more complex analysis problems involving DNA sequences.

Tandem repeats are one type of repetitive DNA. It is a string of characters which recur con­ secutively within a larger string. In biological terms, it is a head­to­tail concatenation of basic units within a DNA sequence where the DNA sequence and basic unit are composed of the bases: A, C, G and T.

A tandem repeat can occur at any position in the DNA sequence. It can start with any base in the basic unit but must continue the repeat with the next base in the basic unit. For example, the tandem repeat CAGGCAGGCAGGCAG has a basic unit of GGCA. The region begins with C and continues with A, the next base in the basic unit. (Note: the subsequent base is the first base in the basic unit. This is what is meant by a head­to­tail concatenation of basic units.) Practical Issues

Tandem repeats are not perfect when they occur in DNA sequences. There are two main types of imperfections: substitutions and indels (an insertion or deletion). A substitution occurs when one or more bases in the sequence do not match the bases expected at that point in the basic unit and when the length of mismatch is the same in both the sequence and basic unit. For example, if two bases in the sequence do not match the basic unit, then the third base in the sequence should match the third base in the basic unit as counted from the most recent matching base (e.g. sequence of GGCATTCA and basic unit of GGCA). An indel occurs when one or more bases in the sequence do not match the bases expected at that point in the basic unit and when the length of mismatch is different in the sequence and basic unit. For example, two bases in the sequence do not match the basic unit and when the bases resume matching there is no region of mismatch in the basic unit (e.g. sequence of GGCATTGGCA and basic unit of GGCA). One of the challenges within this project is to identify repeats which may be imperfect but not too imperfect.

TRLA Project Methodology

We have used dynamic programming with wraparound (Wraparound Dynamic Programming) to locate all tandem repeats which have a basic unit as specified in the input file. The output will be used in conjunction with a graphing program (gnuplot) to identify plateaus. These plateaus represent tandem repeat regions. Wraparound Dynamic Programming is a two­pass process which combines traditional dynamic programming with a second pass to wrap scores within a row. It is this second pass which is critical for identifying tandem repeats. This method is described in a paper (Fischetti) . The scoring function of interest is found on page 118.

1 Fischetti, V.A., etal. ``Identifying Periodic Occurrences of a Template with Applications to Protein Structure.'' Lecture Notes in Computer Science. Proceeding of the Third Annual Symposium on Combinatorial Pattern Matching. Apr­May 1992. Arizona. pp111­120 (Wendt Library) or pp109­118 (Debby).