Running a real science example: BLAST
Objective of this exercise
The objective of this exercise is to run a real scientific application (BLAST) instead of all of these toy examples.
For our final example, we're going to run a real scientific application, BLAST.
BLAST homepage
BLAST Wikipedia entry
The BLAST web site says:
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
I'll try to interpret that for you. BLAST is a pretty cool tool, particularly now that scientists have found lots of DNA and protein sequences. Let's imagine that you're a biologist, and you're curious about some gene. You've located this gene in a yeast, and you want to know if it's in a fly. Good question. Well, biologists have nifty ways of writing down gene and protein sequences--you've probably seen the 4-letters used to represent nucleotides in DNA: GCTA. You can also have sequences of amino acids that make up proteins. If you have a transcription for a gene in yeast and you have the entire genome for the fly, you could just search. It's like searching for a string in a text file. Of course, it's more complicated than that. For one thing, biologists wonder if there is a similar sequence in the fly, not just an identical sequence. After all, things change. So BLAST is a tool for doing that.
Beyond this basic explanation, I can't tell you a whole lot more about it, because it gets a lot more complex than that. If you're curious, check out the links above. Here's a short quote from the above Wikipedia page:
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.
Note that BLAST is pre-deployed in our cluster.
To run BLAST, if you're comfortable with writing shell scripts, I'd like you try to write a shell script to invoke it yourself. If you're not comfortable, you can look at mine below.
BLAST itself is located in /opt/workshop/zkm_exercises/blast/
We have a few BLAST databases, located in /opt/workshop/zkm_exercises/blastdb
.
To run BLAST, you need to do three things from your script:
- Set up your environment, by sourcing
/opt/workshop/zkm_exercises/blast/blast.sh
- Run blast (executable name is
blastp
) with two parameters
- -db /full/path/to/database (where "database" is
yeast.aa
or human_genomic
or nr
)
- -query /full/path/to/some/query/file (I'll give you some queries, don't worry)
- Redirect standard output into a file, so you conveniently name the output of your program.
So an example invocation of blast (step 2) might be:
blastp -db /opt/workshop/zkm_exercises/blastdb/yeast.aa -query query1
Again, this is what your script will do--you won't be able to directly invoke it.
Here is my script, if you need to see it:
#!/bin/sh
if [ $# -ne 3 ]; then
echo "Usage: run-blast
A query looks like the following, which you can put into a file named query1
>Simple yeast query
MPVSDSGFDNSSKTMKDDTIPTEDYEEITKESEMGDATKITSKIDANVIEKKDTDSENNITIAQDDEKVSWLQRVVEFFE
There are two things there: the name of the query, and the stuff to search for. This particular query is looking for a sequences of amino acids in proteins. (That's why there are more than four letters, like the familiar nucleotides in DNA.) If you're curious what the letters mean, you can read about it in Wikipedia.
OK, your first task is to run a Condor job to execute that query against the yeast.aa
database. The job itself should execute quickly (about a second, once the job starts running).
The output will look something like this, but longer. (I chopped it short.)
BLASTP 2.2.24+
Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.
Reference for composition-based statistics: Alejandro A. Schaffer,
L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.
Database: yeast.aa
6,298 sequences; 2,974,038 total letters
Query= Simple yeast query
Length=80
Score E
Sequences producing significant alignments: (Bits) Value
ref|NP_009511.1| uridine permease; Fui1p 170 4e-44
ref|NP_012677.1| dolichyl phosphate-D-mannose:protein O-D-mannos... 23.9 4.6
ref|NP_015305.1| Smt3-processing enzyme; Ulp1p 23.9 4.8
ref|NP_011691.1| Squalene monooxygenase; Erg1p 23.9 5.1
ref|NP_009545.1| putative repressor protein homologous to yeast ... 23.1 7.3
>ref|NP_009511.1| uridine permease; Fui1p
Length=639
Score = 170 bits (430), Expect = 4e-44, Method: Composition-based stats.
Identities = 80/80 (100%), Positives = 80/80 (100%), Gaps = 0/80 (0%)
See that 100% positives? You got a perfect match. No surprise--I copied that sequence right out of the database.
A small change
Change the query by swapping the last two letters. (It will end in EF instead of FE). How does it change the results?
A different database
Run the same query, but against the nr
database. You should notice that instead of being a very fast query (less than a second), it should take a few minutes. The nr database is a collection of protein sequence databases and it's quite large. A larger database means a longer run time. You should have similar results to the earlier query because the yeast protein sequences are also in the nr database.
A larger set of queries
We have ten query files for you. You should run them all against the nr database.
Let's make it a bit more interesting though: the BLAST output is rather verbose. I've written a script to summarize the BLAST output. It's probably not scientifically interesting, but it trims the output quite a bit and stills gives you an interesting part of it.
You should run these ten queries as a DAG. There are ten queries that run first, then the last summarization step runs after all the other DAG nodes complete.
Query files: query1, query2, query3, query4, query5, query6, query7, query8, query9, query10
Summarization script: blast-summarize
You can also find these files in /opt/blast-dag
, which is probably easier for you. (They're on this web site to save for the future.)