CS367 Introduction to Data Structures- Summer 2006

A2.2 Genome

Announcements and Clarifications

7/1 - Solution released

6/28 - Due date changed to Fri 6/30 @ 5pm

Brief Description

Prokaryotic bacteria are cells that do not contain membrane-bound nuclei. Instead the genetic material of this type of bacteria floats freely within the cell. The DNA of prokaryotic cells is contained in a chromosome and in numerous circular rings called plasmids. Your task is to a write a Java program that can read the DNA contained on a plasmid from a file and then perform operations on it.

Your design should be object-oriented using natural objects in the problem space: BasePair, DNA, CircularDoublyLinkedList, ... You should not use any Java collection classes in your implementation (if you are unsure whether you have done this or not please ask).

Goals / Requirements

Implement a circular doubly linked list
Represent the DNA structure using your circular doubly linked list
Review program implementation and design in Java
- Using command line arguments
- Reading from files
- Handling IOExceptions
- Menu based programs
- Object oriented design

DNA Description

DNA consists of a sequence of nucleotide pairs called base pairs. There are four distinct nucleotide bases: adenine(A), cytosine(C), guanine(G), and thymine(T). Each base matches with exactly one other to form the base pairs adenine with thymine (A-T) and cytosine with guanine (C-G). Here is an example of a small fragment of DNA containing 8 base pairs:

ACCTGGAA
||||||||
TGGACCTT

However, the plasmids in prokaryotic cells are rings so the sequence loops around and repeats, there is neither a beginning nor an end on the plasmid sequences. Since the base pairs are distinct we can describe the sequence using the only the first line of nucleotides.

ACCTGGAA

In order to visualize the sequence and to fix a frame of reference; one base pair in the sequence will be known as the focus. All operations on the sequence will start from the focus. Each command may set a new focus at the end of their operation. The example of the interface do not include displaying the options for sake of brevity. User input is marked in bold. The test files we used are accessible here test.gen, test2.gen.

Program Description

Read in DNA sequence from file specified as a single command line argument. If the file does not exist the program should exit and print an error. If their are characters in the file that do not correspond to a nucleotide in a base pair it should print a warning and ignore that character. Do not assume there is any limit on the number of bases we list in an input file.
Display the current DNA sequence for the next 20 base pairs (if the total length of the sequence is shorter than 20, display the whole sequence).
Display menu and ask for user input.
Perform the appropriate task.
Repeat steps 2-4.

You program will have six menu options:

1. Reverse and complement the DNA sequence

This option will reverse the sequence, operations will change from operating clockwise to operating counter-clockwise and vice versa. This will affect all subsequent functions until the option is chosen again.

    ATGC=> ACGT   (reverse)

This option will also flip from the outer nucleotide to the inner nucleotide and vice versa. This will also affect all subsequent functions until the option is chosen again.

    ATGC => TACG   (complement)

    ATGC => TGCA   (reverse and complement)

The focus remains on the same base pair.

wolf(45)% java Genome test.gen 
Sequence:

        ATGC

Genome Options:
... menu options ...
Enter choice (1-6): 1
Sequence:

        TGCA

Genome Options:
... menu options ...
Enter choice (1-6):

2. Insert base pairs

Prompt the user to enter a sequence of base pairs. If the sequence is invalid for any reason prompt the user again. Insert this sequence of base pairs immediately before the focus. The new focus should be at the beginning of the inserted sequence. This function depends on both the reversal and complement functionality.

wolf(52)% java Genome test.gen
Sequence:

        ATGC

Genome Options:
... menu options ...
Enter choice (1-6): 2
Please enter a sequence:
4180581590
Invalid input - '4'
Please enter a sequence:
jqotqjfoifawn
Invalid input - 'j'
Please enter a sequence:
CATTAG
Sequence:

        CATTAGATGC

Genome Options:
... menu options ...
Enter choice (1-6):

3. Remove base pairs

Prompt the user to a enter a number of base pairs to remove. Remove that number of base pairs starting with the focus. The new focus is the base pair immediately after the last base pair removed. If the number of base pairs to be removed is greater or equal to the total number of base pairs in the plasmid, remove them all.

wolf(53)% java Genome test.gen
Sequence:

        ATGC

Genome Options:
... menu options ...
Enter choice (1-6): 3
Please enter the number of base pairs to delete:
-1
Please enter the number of base pairs to delete:
0
Please enter the number of base pairs to delete:
2
Sequence:

        GC

Genome Options:
... menu options ...
Enter choice (1-6): 3
Please enter the number of base pairs to delete:
100
Sequence:



Genome Options:
... menu options ...
Enter choice (1-6):

4. Shift

Prompt the user to enter a positive number. Shift the focus of the plasmid sequence by that number of base pairs. The direction shifted will depend whether the sequence is reversed or not.

wolf(54)% java Genome test.gen
Sequence:

        ATGC

Genome Options:
... menu options ...
Enter choice (1-6): 4
Please enter the number of base pairs to shift by:
2
Sequence:

        GCAT

Genome Options:
... menu options ...
Enter choice (1-6):

5. Search for genes

Prompts the user for two valid base pair sequences (called the start and end codons). These codons are sequences of bases that act as markers on the DNA for specific genes. Each codon should be exactly three bases long. Starting at the focus search for the first sequence of three base pairs that match the start codon. When start codon is located move to the base pair past the end of the start codon. Begin searching for the end codon looking at each set of three base pairs. Record all base pairs that you pass until you find the end codon. Display the base pairs occurring between the start codon and the end codon. The new focus is the base pair after the end of the end codon.

For example (the base in focus is marked by the "^" character):

 
    Start sequence:    CATAGATAG
                       ^
    Start Codon: CAT
    End Codon: TAG
    Gene sequence: AGA

    After search:      CATAGATAG
                       ^
    Start Codon: AGA
    End Codon: TAG
    Gene sequence: nothing 

    After search:      CATAGATAG
                       ^
    Start Codon: TAG  
    End Codon: AGA
    Gene sequence: CAT 

    After search:      CATAGATAG
                             ^

Only sets of three bases are considered during the search so there are only three sets of sequences in the above example:

    CATAGATAG => CAT AGA TAG
    ^

If the focus is different the sequences may also be different:

    CATAGATAG => ATA GAT AGC
     ^

We will not test start and end codons that do not exist in the plasmid we give you, assume it is always possible to find a gene of finite length. Also, the start and end codons will never overlap. However, the gene may contain zero base pairs.

This function depends on both the reversal and complement functionality.

wolf(67)% java Genome test2.gen
Sequence:

        CATAGATAG

Genome Options:
... menu options ...
Enter choice (1-6): 5
Start Codon:
Please enter a sequence:
AG
Codon must be exactly 3 bases.
Please enter a sequence:
AGA
End Codon:
Please enter a sequence:
CAT
Result of search: TAG
Sequence:

        AGATAGCAT

Genome Options:
... menu options ...
Enter choice (1-6):

6. Quit

Exit the program.

Commenting and Style

Your program should be written in a style that makes it easy to read and understand.
At the beginning of each .java file you should include a description of the class and how it interacts with the other parts of your program.
You are not required to use javadoc style comments.
If your code is doing something complex or non-standard please comment that portion heavily.

Handin

Please hand all necessary files into your handin directory in a subdirectory named Genome. If you program does anything strange (bugs), awesome(extra features) or has a non-intuitive interface please include a file called README.txt which explain them. If there are bugs in your program but you do not describe them in your README you will lose more credit than if you had described them.

Hints

Develop incrementally:
- Get your circular doubly linked list working
- Use your circular doubly linked list with the basic classes in the program (BasePair)
- Develop the menu options incrementally (from easiest to hardest)
Test with small plasmid sequences first.

Introduction to Data Structures
CS367 - Summer 2006

A2.2 Genome

Announcements and Clarifications

Brief Description

Goals / Requirements

DNA Description

Program Description

1. Reverse and complement the DNA sequence

2. Insert base pairs

3. Remove base pairs

4. Shift

5. Search for genes

6. Quit

Commenting and Style

Handin

Hints

Sections

Assignments

Links

Introduction to Data Structures CS367 - Summer 2006

A2.2 Genome

Announcements and Clarifications

Brief Description

Goals / Requirements

DNA Description

Program Description

1. Reverse and complement the DNA sequence

2. Insert base pairs

3. Remove base pairs

4. Shift

5. Search for genes

6. Quit

Commenting and Style

Handin

Hints

Sections

Assignments

Links

Introduction to Data Structures
CS367 - Summer 2006