Phylogenetic Trees

We now begin Unit Two, in which we will be discussing phylogenetic trees.

“Phylogenetic” has two roots:

“the origin of types”

Phylogenetic trees are the “evolutionary trees” that you see in science museums and nature shows.

Here's a sample phylogenetic tree for baboons:

Here's another one for icthyosaurs:

Phylogenetic trees don't need to be based on species.

From these examples, we see that the following holds true:

Some trees do not have a well-defined common ancestor.

Constructing Phylogenetic Trees

The basic task:

Why construct trees?

There are three different common goals for constructing a tree:

Similarly, trees can be constructed from various types of data.

Rooted & Unrooted Trees

Here are the two kinds of tree:

Unrooted TreeRooted Tree

Remember:

Notice that these two trees are made from the same data.

First question: “Given n species/genes/sequences/etc., how many trees is it possible to construct?”

The solution involves a fair amount of graph theory, and it turns out like this:

Entering numbers into the above formulas, we get the following table:

# Leaves (n) # Unrooted Trees # Rooted Trees
4315
515105
6105945
810,395135,135
102,027,02534,459,425

In other words:

Distance-Based Methods

We start by considering a couple of distance based methods: UPGMA and Neighbor Joining.

We define the problem as follows:

The basic idea behind a distance based approach is this: the (vertical) distance between any two leaves should correspond to their predefined distance.

UPGMA

One of the most basic examples of such a method is UPGMA.

The algorithm runs as follows:

  1. Assign each sequence to its own cluster.
  2. Define one leaf for each sequence; place it at height 0.
  3. While more than two clusters:
    1. Determine two clusters i, j with smallest dij.
    2. Define a new cluster Ck by merging Ci and Cj.
    3. Define a node k with children i and j; place it at height (½ dij).
    4. replace clusters i and j with k.
    5. Compute distance between k and other clusters.
  4. Join last two clusters, i and j, by root at height (½ dij).

Fortunately, there's a shortcut to calculating new distances:

Example

Let us say you are trying to create a phylogenetic tree for five species, labeled A through E. Their distances from each other are given by the following matrix:

Build the tree using UPGMA.

Solution:
Here's how the problem starts out. First we see which two leaves are closest to each other.

Repeating, we chose to merge B and C. Now AE is closest to D. And merge the last two, making the tree's root.

You may have noticed that in this example, the distances were remarkably well behaved.

Actually, this data was constructed using the molecular clock assumption.

How can you tell ultrametric data?

In general, for real data:

Neighbor Joining

Neighbor Joining is another distance-based method for reconstructing phylogenetic trees from distance data.

There are also two key differences in the merging process:

So far, each iteration we've chosen the two clusters that are closest together.

Should A be joined with C or D?

How can we make a method that will choose C over B?

Let us define the penalty factor ui as follows:

Every iteration, we will merge the two clusters that have the lowest biased distance score D:

When we merge two clusters i and j into a new cluster k, it looks like this:

We can find k's distance to any other cluster m using:

This equation corresponds to a situation like this:

The distance from i or j to its new parent is calculated in a similar way:

However, it's a little easier to use the u values:

So the total algorithm is as follows:

Example

Here is a matrix of distances. Use Neighbor Joining to solve for the tree.

ABCDEF
A054768
B0710911
C0768
D059
E08
F0

Solution:

From the matrix, calculate all the u scores. Then calculate the biased distances, and insert the new cluster. Keep going until done.

ABCDEF
A054768
B0710911
C0768
D059
E08
F0
uA = (5 + 4 + 7 + 6 + 8) ÷ 4 = 7.5
uB = (5 + 7 + 10 + 9 + 11) ÷ 4 = 10.5
uC = (4 + 7 + 7 + 6 + 8) ÷ 4 = 8
uD = (7 + 10 + 7 + 5 + 9) ÷ 4 = 9.5
uE = (6 + 9 + 6 + 5 + 8) ÷ 4 = 8.5
uF = (8 + 11 + 8 + 9 + 8) ÷ 4 = 11

Of all the biased distances, we have a tie:

DAB = 5 - 7.5 - 10.5 = -13
DDE = 5 - 9.5 - 8.5 = -13
We'll arbitrarily choose to merge A and B. We need to find each one's distance to its new parent:
dA,AB = (5 + 7.5 - 10.5) ÷ 2 = 1
dB,AB = (5 + 10.5 - 7.5) ÷ 2 = 4

We also calculate each other element's distance to AB, and start a new step.

ABCDEF
AB03657
C0768
D059
E08
F0
uAB = (3 + 6 + 5 + 7) ÷ 3 = 7
uC = (3 + 7 + 6 + 8) ÷ 3 = 8
uD = (6 + 7 + 5 + 9) ÷ 3 = 9
uE = (5 + 6 + 5 + 8) ÷ 3 = 8
uF = (7 + 8 + 9 + 8) ÷ 3 = 10.67

Another tie in the biased distances:

DC,AB = 3 - 7 - 8 = -12
DDE = 5 - 9 - 8 = -12
Arbitrarily choose D and E.

Find each one's distance to its new parent:

dD,DE = (5 + 9 - 8) ÷ 2 = 3
dE,DE = (5 + 8 - 9) ÷ 2 = 2

And do the distances for the next full matrix:

ABCDEF
AB0337
C048
DE06
F0
uAB = (3 + 3 + 7) ÷ 2 = 6.5
uC = (3 + 4 + 8) ÷ 2 = 7.5
uDE = (3 + 4 + 6) ÷ 2 = 6.5
uF = (7 + 8 + 6) ÷ 2 = 10.5

The smallest biased distances are:

DC,AB = 3 - 6.5 - 7.5 = -11
DDE,F = 6 - 6.5 - 10.5 = -11

Join C and AB. Find each one's distance to its new parent:

dC,ABC = (3 + 7.5 - 6.5) ÷ 2 = 2
dAB,ABC = (3 + 6.5 - 7.5) ÷ 2 = 1

And do the distances for the next full matrix:

ABCDEF
ABC026
DE06
F0
uABC = (2 + 6) ÷ 1 = 8
uDE = (2 + 6) ÷ 1 = 8
uF = (6 + 6) ÷ 1 = 12

We have a 3-way tie in the biased distances:

DABC,DE = 2 - 8 - 8 = -14
DABC,F = 6 - 8 - 12 = -14
DDE,F = 6 - 8 - 12 = -14
Arbitrarily choose ABC and DE.

Find each one's distance to its new parent:

dABC,ABCDE = (2 + 8 - 8) ÷ 2 = 1
dDE,ABCDEDE = (2 + 8 - 8) ÷ 2 = 1

And, at long last:

ABCDEF
ABCDE05
F0
The u scores have ÷ 0, but don't matter anyway on the last step.

Just connect ABCDE and F with a distance of 5.

Rooting Unrooted Trees

We just put a lot of effort into joining these taxa into a tree...but it's unrooted.

But how can we decide who's an ancestor of whom?

Answer: use an outgroup.

Additivity

So, intuitively, what does additivity mean?

In principle, one might expect high-fidelity data to be additive.

Just like with ultrametric data, there's a test for additive data:

Here's a graphical representation of why this is so:

Neighbor Joining returns the proper tree if the data is additive.

Complexity Analysis

The time and space required to run distance-based tree-building methods relies mainly on the number of taxa in the dataset.

For time complexity:

Before, I hinted that O(n³) may be too slow/too big.

If we need to calculate the distances, this is an additional O(n²) step.

For space complexity, we need to store two things:

  1. The distance matrix/matrices.
    • Size O(n²).
  2. The growing tree.
    • Size O(n).

The total space complexity is thus O(n²).

Parsimony

Distance-based tree methods work fairly well, but they have a flaw.

Parsimony-based methods don't use distance matrices.

Example

Given the four entities on the bottom of the each tree, which tree better explains their ancestry?

Solution:
Both trees could potentially indicate the ancestry of Smilius familiaris. However, the tree on the right has fewer changes. Therefore, parsimony says that it is the more likely tree.

Here's a slightly less cute example:

One thing that you may have noticed in these examples is that every feature or base/amino acid is independent of all the others.

Parsimony approaches usually have two different components:

  1. Search through many different tree topologies to find a good one.
  2. Find the minimum number of changes needed to explain the data, for a given tree topology.

We'll start by focusing on just doing the latter.

Fitch's Algorithm

The simplest parsimony problem to solve is an unweighted small parsimony problem.

Fitch's Algorithm is the most common solution.

Fitch's Algorithm has two main steps:

  1. Traverse the tree from leaves to root (“bottom-up”), determining a set of possible states (e.g. nucleotides) for each internal node.
  2. Traverse the tree from root to leaves (”top-down”), picking ancestral states from the sets calculated in Step 1.

The first step runs from the leaves up to the root.

The second step runs from the root to the leaves.

  1. If a node's parent's state is in its own set, pick it.
  2. If not (or the node doesn't have a parent), just pick one arbitrarily.

Example

Looking at six related proteins, we conjecture that they are related as in the tree shown below. The base at position 521 is (in the six proteins respectively) C, T, G, T, A, and T. Use Fitch's Algorithm to determine which base was present in the ancestor proteins, back to the root.

Solution:
Using Fitch's Algorithm, we go through each node bottom-up, determining a set of possible states. Then we go back through top-down, narrowing each down to a single state.

(If this were automated, we'd do the exact same thing for all of the other bases too.)

Complexity Analysis of Fitch's Algorithm

The running time of Fitch's Algorithm depends on 3 things:

  1. The number of taxa to be related (n).
  2. The number of allowed states for each node (k).
  3. The total length of the sequences being compared (s).

The number of taxa determines the number of ancestor nodes.

The number of possible states (e.g. 4 for nucletide bases) is also important in this algorithm.

Finally, we need to do all of these steps once for each character in the sequences.

So the total time complexity is O(nks).

Space complexity is a similar calculation.

Sankoff's Algorithm

The weighted small parsimony problem adds a complication:

ACGT
A0943
C044
G02
T0

Sankoff's Algorithm provides an answer.

Basic idea:

For example, let us assume that a certain node is in state t.

Formally, the cost table for each leaf is determined by: and the cost table for every other node is given by: where δi,t represents the cost to get from state i to state t.

The second step is simple:

Example

Solve the weighted small parsimony problem for the bases A, C, T, and G related as shown in the tree below. Use the weight matrix shown.

ACGT
A0943
C044
G02
T0

Solution:

Start by filling in the cost tables for the leaf nodes.

Continue upwards to the root, filling in the score tables. Then assign all the nodes the states that lead to the lowest cost at the root.

Complexity Analysis of Sankoff's Algorithm

Complexity analysis is much the same as that for Fitch's Algorithm.

P and NP

Before we get to the large parsimony problem, we need to make another digression into complexity theory.

Some problems have polynomial complexity.

Other problems are nondeterministically polynomial.

Think of a jigsaw puzzle.

This is part of the reason scientists are so interested in making a quantum computer.

Many NP problems are logically equivalent to each other.

As strange as it may seem, no one's actually proven that an NP-complete problem can't be solved in P time.

But for the purposes of bioinformatics, what it means is this:

The Large Parsimony Problem

If we're given a tree a priori, finding the most parsimonius ancestors is easy.

As we saw above, trying every tree is unreasonable:

# Leaves (n) # Unrooted Trees # Rooted Trees
4315
515105
6105945
810,395135,135
102,027,02534,459,425

We have to search for a good tree.

This is a heuristic approach.

There is a little good news:

Searching

Say you're in a large old house, and you hear a noise.

Will you find the source?

This is a searching problem.

The brute force solution would be:

Searching for Parsimonious Trees

Since we can't look at every tree, we need another approach.

Here's how we set up the problem.

Here are the steps to the algorithm:

  1. Start with a tree.
  2. Do small parsimony on it.
  3. Keep doing:
    • Do small parsimony on all neighbors of current tree.
    • Chose neighbor with best cost, and switch to that one.
    • If the current tree had a lower cost than all its neighbors, stop loop.
  4. Current tree is best one found.

For the current tree, how do we determine which trees are adjacent?

One idea: pick two nodes from different areas of the tree, and swap them.

The approach the text uses:

Where should you start the search?

Using a heuristic best estimation:

Rapid Random Restart:

For the parsimony problem, we can use the former approach:

Branch and Bound

There is another way to shortcut the search for the most parsimonious tree.

Notice that as a tree gets bigger, its cost will only increase.

What you need is to enumerate all trees, in a way that you look at their subtrees too. Here's one method:

The search “branches” out.

Branch and Bound is not usually considered a heuristic.

Summary

Phylogenetic trees show the relations between many taxa. There are usually three different approaches to reconstructing a tree:

UPGMA is a distance-based method.

Neighbor Joining is a similar distance-based method.

Parsimony methods find the relations that have the least amount of change.

Truly finding the best tree by parsimony is usually an intractable problem.