Abstract: BLAST (Basic Local Alignment Search Tool) and Automat are two programs that are used in sequence analysis that make use of deterministic finite Automata. These programs search large protein or nucleic acid databases in order to find segments of sequences similar to a sequence being queried. While their function is similar, BLAST and Automat implement their search algorithms and finite automata somewhat differently. While hash tables can also be used to implement the search algorithms, DFAs provide a method that is both faster and more space efficient, and thus has become the implementation of choice. Given that BLAST is implemented in a way most digestable to CS 520 students, it will comprise the rest of our discussion. Introduction: The advent of modern sequencing techniques have enabled molecular biologists to obtain sequences of DNA and proteins from various organisms, and compile this genetic information into huge databases. Tools are needed not only to search these databases for specific "query sequences" (i.e., a sequence that one might obtain from an unknown gene in a lab), but also to find sequences that show similarities to these sequences. Finding and quantifying similarities between sequences enables scientists to predict evolutionary relationships and potential function of these sequences (sequences that are useful to an organism will be highly conserved throughout evolution). In it's search for sequences similar to the query sequence, BLAST finds the most similar sequences called Maximal Segment Pairs (MSPs), which are in laymans terms best matches. These MSP's refer to a subsequence of matches that produced the highest score according to BLAST's scoring algorithm (anything shorter or longer would have produced a lower score). As mentioned earlier, such searches tell scientists what organisms the similar (and exact) matches come from, and how similar they are to the query sequence. This allows the deduction of evolutionary and functional relationships. How BLAST finds the MSP's: In order to search the target sequence, BLAST first constructs "words" from the query sequence (the length of these words is specified by a parameter called w and is usually 3-4 for amino-acid sequences and > 11 for nucleic acid sequences) and then searches the target sequence for matches using these words. The list of words is the set of all substrings of the query sequence of length w. Each match made with the target sequence has a certain score, and those regions in the target sequence with a high enough score to get above a certain threshold are called triggering matches. BLAST then searches the region around these triggering matches in an attempt to maximize the score. Once the score falls below a certain value. BLAST reports the local alignment generated as a Maximal Segment Pair (MSP). There are different approaches of searching for the triggering events. One obvious approach is using a hash table. Unfortunately, this is extremely space-inefficient and takes a long time. A faster approach is using an algorithm based on DFAs (1). In this approach, a DFA is constructed in order to recognize each word of length that is a substring of the query sequence. (i.e., the alphabet is all nucleic acids (A,T,G,C) or all amino acids). These DFAs are then used to search for triggering matches instead of hashtables--this approach saves both time and space. Example: Let's say we've discovered a polypeptide sequence and want to find similar sequences in the database stored at National Center for Biotechnology Information (NCBI). Let the query sequence be: (each letter stands for one amino acid. There are 20 amino acids, and the symbols for all amino acids is treated like an alphabet)
LEDKVEELLS KNYHLENEVA RL
BLAST will search the database and return a user-specified number of similar sequences, in decreasing order of similarity. So, blast returns the exact match:
>gi|2218122 (U94951) hinge-region and leucine-zipper with N-terminal PelB-leader and C-terminal hexahistidine tag [Cloning vector pZIP1] Length = 84Score = 109 (47.7 bits), Expect = 1.1e-09, P = 1.1e-09 Identities = 22/22 (100%), Positives = 22/22 (100%) Query: 1 LEDKVEELLSKNYHLENEVARL 22 LEDKVEELLSKNYHLENEVARL Sbjct: 48 LEDKVEELLSKNYHLENEVARL 69
and some other matches with lower scores:
>gnl|PID|e325261 (Z97204) hypothetical protein [Schizosaccharomyces pombe] Length = 932 Score = 43 (18.8 bits), Expect = 2.6, Sum P(2) = 0.93 Identities = 8/13 (61%), Positives = 11/13 (84%) Query: 2 EDKVEELLSKNYH 14 E+K+EE L KNY+ Sbjct: 558 EEKIEEGLLKNYY 570 Score = 31 (13.6 bits), Expect = 2.6, Sum P(2) = 0.93 Identities = 5/7 (71%), Positives = 7/7 (100%) Query: 14 HLENEVA 20 HLEN++A Sbjct: 842 HLENKIA 848
Note that the MSPs only contain a substring of the original query sequence and a substring of the target sequence. The +'s indicate similar amino acids. Conclusion BLAST is a program that has become ubiquitous in biocomputing due to its speed and usefulness. It does not, however compensate for insertions and deletions that frequently occur in nature (i.e., it does not insert gaps). BLAST is also an example of the usefulness of the language recognition task that can be performed efficiently by deterministic finite automata.References:
1) Altschul, S. Gish, W. Miller W. Myers E. Lipman D. Basic Local Alignment Search Tool. J. Mol. Biol. (1990) 215, 403-410. 2) Alschul, S. Gish, W. Local Alignment Statistics. (1996) Methods in Enzymology (1996) 266, 460-481. 3) Cantalloube, H. et. al., Automat and BLAST: comparison of two protein sequence similarity search programs. CABIOS (1995) 11, 261-272 4) Wisconsin Package Version 9.0, Genetics Computer Group (GCG), Madison, Wisc.