Exercise 3: DNA Analysis

There are two parts to this week's exercise:

  • dna_analysis.py - in this part you will write a program which analyzes a user-input string of DNA. This part may be completed in pairs.
  • Review questions - this part will not be graded, but we will be talking about them in class on 2/17 and they will help you prepare for the exam, so you may want to work through them.

Note that while you MAY work in pairs this week, as always, this assignment isn't particularly complex and would make a great review for the midterm. Even if you do plan to work in pairs this week, I encourage you to at least make an attempt on your own first, to see what you do and don't understand.

Background

A major area in computer science research is bioinformatics, which in large part deals with the intersection of biology and software. At some point, you may have learned about DNA, composed of four types of "bases":

  • A - adenine
  • C - cytosine
  • G - guanine
  • T - thymine

The human genome, for example, is roughly 4 Gb (gigabases, or billion bases). In 2001, the Human Genome Project published their first draft sequence of the human genome, with the project having taken over a decade and costing roughly $3 billion to complete. According to the National Human Genome Research Institute, the sequencing of a full human genome costs a little over $1,000 today. The machines that do the sequencing end up producing files which contain long strings of As, Gs, Cs, and Ts, representing the bases sequenced.

One of the things that biologists are interested in is the GC-content of a particular sequence, which is the percentage of the bases that are Gs or Cs, compared to As and Ts (which will split from their pairs - called annealing - more easily than Gs and Cs). For example, in the following sequence:

A T C A G A A C T A

the GC-content is considered to be 30% (since there are 3 bases which are either G or C, and 10 total bases).

Problem statement

For this assignment, your program will take bases as input and generate a summary of the GC-content. If you were to run your program with the sequence in the previous section as input, the following would be the output:

Enter your sequence one base at a time.
A
T
C
A
G
A
A
C
T
A
(enter)
DNA Sequence: ATCAGAACTA
GC Content: 30.0%

Each time the user types a character (A/G/C/T), they should also press Enter, as above. A blank line (i.e. just pressing Enter without even typing a character) is what indicates to your program that it should stop processing input and display a summary.

User input checking

In this program, you'll begin checking the user's input so that simple typos don't cause your code to crash.

If the user enters an empty line as the only input, our program should print "Sequence is 0 bases long!" and exit using exit():

Enter your sequence one base at a time.
(enter)
Sequence is 0 bases long!

As soon as the user enters an invalid character, our program should display an error message and exit using exit(). Invalid characters include lowercase a/g/c/t, and all other characters:

Enter your sequence one base at a time.
A
G
T
C
k
k isn't a valid nucleotide!

Writing the code

This week, we'll be requiring that you complete a couple of key functions, along with achieving the output that you need. If you get the right output but don't use the functions, you will lose points!

The required functions are:

  • is_valid(base): expects one string input and returns True if the input value, base, is a valid base (A/C/G/T) and False for anything else.
  • gc_percent(gc, total): expects two integer inputs, the number of bases which are G or C and the total number of bases in the sequence, and returns a float percentage (between 0 and 100) representing the GC content of the sequence.

    For example, gc_percent(3,8) should return 37.5 (not 0.375).

You may implement other functions as you wish, but these two functions must be implemented exactly!

Step 1: Input

Begin by implementing and testing the is_valid(base) function - greet the user with the prompt, and enter a loop where you're continually getting user input and outputting whether it's valid or not:

Enter your sequence one base at a time.
A
True
G
True
k
False
T
True
C
True

Next, remove your testing output and start storing your input bases. Change your loop condition to detect when the user hits Enter without typing a base, and display the full sequence:

Enter your sequence one base at a time.
A
G
T
C
(enter)
DNA Sequence: AGTC

This might also be a good time to use your new is_valid(base) function to do some of the user input checking described above.

Step 2: Analysis

Now, implement your gc_content(gc,total) function. The function's contents are pretty straightforward, but calculating its input values will require a little more implementation on your part.

This testing output (don't include this in your final program) might give you a hint, if you need one:

Enter your sequence one base at a time.
A
GC: 0
G
GC: 1
T
GC: 1
C
GC: 2
(enter)
DNA Sequence: AGTC
GC Content: 50.0%

Commenting your code

As with the previous assignment, some of your program points will come from commenting your code.

In addition to our file header from the previous section, add a 1-2 sentence description of the program and its function to the top of your code. For example, a good description comment for the triangles.py program from lab 1 might be something like:

########################################################################
#
# Lab 1: Heronian Triangles
#
# ======================================================================
#
# This program implements a formula for calculating the area of a triangle
# given its side lengths. Additionally, it will check whether the triangle 
# qualifies as "Heronian"; that is, the sides and area are all integers.
#
########################################################################

In addition, you should include comments within your code describing what you're doing. You don't need to comment on every line of code, but you should include comments as though you're walking the TA through your code. You can also use comments as I do in class, to create an outline for your code that you then fill in. (Consider my in-class commenting a lower bound of how frequently you should comment.)

Submitting your files

As usual, you'll be handing in your lab work via the course Learn@UW dropboxes. Navigate to our 301 course page, and click the Dropbox link in the top navigation bar. You should see a dropbox for Program 3 - this is where you should hand in your dna_analysis.py file.

If you worked in a pair, only one person will need to hand in the code.

Note that the dropbox will close at noon on 18 February, so be sure to submit your files before then.

Midterm Review Questions

This part of the assignment is ungraded, but we'll be talking about the questions in class on 17 February. I recommend working through the problems on your own first, but you may collaborate with as many people as you want (since this part is just for you and doesn't count toward your grade).

> REVIEW QUESTIONS <

This is not intended to be similar to the length of a full exam, but questions are of comparable difficulty to the questions you will encounter on the exam.