Homework 4 for CS520

Protein Sequence Similarity Search Programs

by Oguz Yetkin, Andrew Paupore

 
Abstract:   

	BLAST (Basic Local Alignment Search Tool) and Automat 
are two programs that are used in sequence analysis that make use 
of deterministic finite Automata.  These programs
search large protein or nucleic acid databases in order to find segments
of sequences similar to a sequence being queried.  While their function is
similar, BLAST and Automat implement their search algorithms and finite
automata somewhat differently. While hash tables can also be used to
implement the search algorithms, DFAs provide a method that is both faster
and more space efficient, and thus has become the implementation of
choice.  Given that BLAST is implemented in a way most digestable to CS
520 students, it will comprise the rest of our discussion. 



Introduction:


	The advent of modern sequencing techniques have enabled molecular
biologists to obtain sequences of DNA and proteins from various organisms,
and compile this genetic information into huge databases.  Tools are
needed not only to search these databases for specific "query sequences"
(i.e., a sequence that one might obtain from an unknown gene in a lab),
but also to find sequences that show similarities to these sequences.
Finding and quantifying similarities between sequences enables scientists
to predict evolutionary relationships and potential function of these
sequences (sequences that are useful to an organism will be highly 
conserved throughout evolution).

	In it's search for sequences similar to the query sequence, BLAST
finds the most similar sequences called Maximal Segment Pairs (MSPs),
which are in laymans terms best matches. These MSP's refer to a
subsequence of matches that produced the highest score according to
BLAST's scoring algorithm (anything shorter or longer would have produced 
a lower score).  

As mentioned earlier, such searches tell scientists what organisms the
similar (and exact) matches come from, and how similar they are to the
query sequence.  This allows the deduction of evolutionary and functional
relationships.


How BLAST finds the MSP's:


	In order to search the target sequence, BLAST first constructs
"words" from the query sequence (the length of these words is specified by
a parameter called w and is usually 3-4 for amino-acid sequences and > 11
for nucleic acid sequences) and then searches the target sequence for
matches using these words.  The list of words is the set of all substrings
of the query sequence of length w.  Each match made with the target
sequence has a certain score, and those regions in the target sequence
with a high enough score to get above a certain threshold are
called triggering matches.  BLAST then searches the region around these
triggering matches in an attempt to maximize the score.  Once the score
falls below a certain value.  BLAST reports the local alignment
generated as a Maximal Segment Pair (MSP).  There are different approaches
of searching for the triggering events.  One obvious approach is using a
hash table. Unfortunately, this is extremely space-inefficient and takes a
long time.  A faster approach is using an algorithm based on DFAs (1).  In
this approach, a DFA is constructed in order to recognize each word of length 
that is a substring of the query sequence.  (i.e., the alphabet is all nucleic 
acids (A,T,G,C) or all amino acids). 
These DFAs are then used to search for triggering matches instead of
hashtables--this approach saves both time and space. 


Example:


Let's say we've discovered a polypeptide sequence and want to find similar
sequences in the database stored at National Center for Biotechnology
Information (NCBI).

Let the query sequence be:
(each letter stands for one amino acid.  There are 20 amino acids, and the
symbols for all amino acids is treated like an alphabet)
LEDKVEELLS KNYHLENEVA RL

BLAST will search the database and return a user-specified number of
similar sequences, in decreasing order of similarity.
So, blast returns the exact match:
>gi|2218122 (U94951) hinge-region and leucine-zipper with N-terminal
            PelB-leader and C-terminal hexahistidine tag [Cloning vector pZIP1]
            Length = 84

Score = 109 (47.7 bits), Expect = 1.1e-09, P = 1.1e-09 Identities = 22/22 (100%), Positives = 22/22 (100%) Query: 1 LEDKVEELLSKNYHLENEVARL 22 LEDKVEELLSKNYHLENEVARL Sbjct: 48 LEDKVEELLSKNYHLENEVARL 69

and some other matches with lower scores:

>gnl|PID|e325261 (Z97204) hypothetical protein [Schizosaccharomyces pombe]
            Length = 932

 Score = 43 (18.8 bits), Expect = 2.6, Sum P(2) = 0.93
 Identities = 8/13 (61%), Positives = 11/13 (84%)

Query:     2 EDKVEELLSKNYH 14
             E+K+EE L KNY+
Sbjct:   558 EEKIEEGLLKNYY 570

Score = 31 (13.6 bits), Expect = 2.6, Sum P(2) = 0.93
 Identities = 5/7 (71%), Positives = 7/7 (100%)

Query:    14 HLENEVA 20
             HLEN++A
Sbjct:   842 HLENKIA 848
Note that the MSPs only contain a substring of the original query 
sequence and a substring of the target sequence.  The +'s indicate similar 
amino acids.  


Conclusion


	BLAST is a program that has become ubiquitous in biocomputing due
to its speed and usefulness.  It does not, however compensate for
insertions and deletions that frequently occur in nature (i.e., it does
not insert gaps).  BLAST is also an example of the usefulness of the
language recognition task that can be performed efficiently by
deterministic finite automata.


References:
1) Altschul, S. Gish, W. Miller W. Myers E. Lipman D. Basic Local Alignment 
	Search Tool. J. Mol. Biol. (1990) 215, 403-410.
2) Alschul, S. Gish, W.  Local Alignment Statistics.  (1996) Methods in
	Enzymology (1996) 266, 460-481.
3) Cantalloube, H. et. al., Automat and BLAST: comparison of two protein
	sequence similarity search programs.  CABIOS (1995) 11, 261-272
4) Wisconsin Package Version 9.0, Genetics Computer Group (GCG), Madison,
   	Wisc.