Trees
7.0 Non-linear Data Structures
Recall from last week that we talked about stacks, queues, and linked lists. These are all data structures designed to handle linear data. From here onward, we will be interested also in non-linear data structures. Uses for non-linear data structures include improving the performance of linear data (such as with Priority Queues where we may want to add tasks anywhere in the queue) or representing data that is naturally organized in a non-linear fashion (such as a family tree).
7.1 Trees
Our primary non-linear data structure is the Tree. To understand the structure of a tree, begin by thinking about the linked list structure we considered last week. Imagine if you allowed a node to have multiple "next" variables. That is the essence of a tree. We begin with a "root" node. The root node may or may not contain data (depending on the type of tree), and it has some number of "children" that are also nodes. Each of those children may have data, and may have some number of children of its own. At the bottom of any tree are "leaf" nodes, which are nodes that do not have any children. Leaf nodes always have data. Nodes that do have children are sometimes called "internal" nodes.
- A binary tree is a tree where each node has at most two children, generally called its "left" and "right" child.
- The depth of a node is the number of steps from child to parent that are required to go from it to the root node. (So the root has a depth of 0.)
- The height of a tree is the length of the longest path from the root to a leaf. Equivalently, the height of a tree is the maximum depth of any node in the tree.
- A tree is said to be complete if every internal node has the same number of children (including the root). This term is most often used in the context of a complete binary tree.
- A subtree is some node, its children, their children, and so on. This first node is said to "root" the subtree, in the same way that the "root" node roots the entire tree.
Note that child order matters.
root root
/ \ is different from / \
a b b a
7.2 Tree Traversal
Sometimes we will want to take a tree and linearize it. This is called a "tree traversal". When we traverse a tree, we want to do something (such as print data) for each node in the tree in some reasonable order. We might also conduct a tree traversal for searching purposes. Since a tree is not naturally linear, we have to make some choices.
A
/ \
B C
/ \
D E
Consider the simple tree above. Unless we are traversing in some truly strange way, we will certainly want to print D before E. But what about B? Our three basic options are to print the parent before its children ("pre-order traversal"), print the parent after its children ("post-order traversal"), or print the parent between its children ("in-order traversal").
- Pre-order traversal: B D E
- In-order traversal: D B E
- Post-order traversal: D E B
Task:
- Complete the Pre-, In-, and Post-order traversals of the graph above.
7.3 Trees in Python
We can implement a tree in python in much the same way we implemented a linked list in python. We begin by defining a Node class. One possible implementation is shown here:
class Node():
def __init__(self,data,left,right):
self.data = data
self.left = left #Left child
self.right = right #Right child
This Node class would be useful for implementing a binary tree, since it defines a left and a right child. We can create a new tree fairly easily, and we can begin the method that adds a new node to a tree like so:
class BinaryTree():
def __init__(self):
self.isEmpty = True
def append(self, data):
if self.isEmpty:
self.root = Node(data,None,None)
self.isEmpty = False
What should we do when appending a node when it is not the only node in the tree? What should we do when removing a node? The answers to these questions depend on the type of tree we have made. Like the linked list, the tree is a tool that we use to implement other data structures, so our choices will be determined by what those data structures need.
One option is to expose the inner workings of the tree, and simply build the tree by manually creating and attaching each node. I might write the following script:
root = Node('A',None,None)
root.left = Node('B',None,None)
root.right = Node('C',None,None)
root.left.left = Node('D',None,None)
root.left.right = Node('E',None,None)
This script, combined with the definition of a Node above, will generate the small tree from the previous section.
Tasks:
- How might you implement the Node class for a tree where a node might have any number of children?
- Write a function that, given a binary tree, prints a pre-order traversal of that tree.
Searching
7.4 General Approaches
A lot of what we do involves searching. Considering the following sorted list of numbers:
1 3 4 6 7 12 13 14 18 19 21 22 23 30 33 35
If I ask whether this list contains the number 21, we can pretty easily reply that it does. In fact, the searching process remains fast even if I expand the set of numbers to search in, provided those numbers are sorted.
2 5 7 10 13 16 20 23 24 27 30 34 35 36 37 41 45 46 48 52 56 57 59 62 64 65 66 69 71 75 78 80 84 86 90 93 95 96 99 101 104 105 107 110 114 118 120 121 124 127 129 131 135 139 141 144 146 150 151 155 156 160 161 164 168 172 173 177 180 183 186 189 193 196 197 200 204 207 211 213 216 217 218 219 221 223 227 231 235 236 237 241 245 246 250 253 254 258 261 265
Is the number 119 present in the sorted list above? Think about how you found where 119 would be, and how useful it was to know that the list was already sorted for you. In an earlier lab, I asked you to find the smallest set of features that was missing from the titanic_fatalities training data. That was also a search task, but you did not have conveniently sorted data to work with.
There are three fundamental search patterns that we use:
- Sequential search
- Binary search
- Hashing
7.5 Sequential Search
In the case of the titanic_fatalities data, since we did not have organized data, our only option was to check every piece of our training data individually. Sequential search makes note of the reference (what we're looking for) and then compares the reference to each and every possible element. If we find an element that matches the reference, then we can stop early, but otherwise we continue until we have checked everything exactly once.
Sequential search is O(n) in the size of the data we are searching, because we may have to compare each data element to the reference once.
If we know something about the data, we may be able to skip parts of it when doing sequential search. For example, if we are searching through sorted numbers, we can stop once we see a number that is larger than our reference. We know at that point that if we have not yet seen it, we never will.
Task:
- Consider the function myIndex() below. (You should have written something similar for Lab 4.) This is an example of a sequential search algorithm. Explain in your own words why it is O(n) and how it correctly determines whether v is a value contained in the given dictionary.
def myIndex(d,v):
"""Find a key in d with value v, or raise a ValueError"""
for key in d:
if d[key] == v:
return key
raise ValueError(str(v) + " not in dictionary")
7.6 Binary Search
In many settings, we have enough information to be able to compare two elements of our search space in some consistent way, and we have our search space sorted according to those comparisons. This gives us the ability to engage in the binary search pattern. Although more restricted in its use, binary search runs much faster. The idea behind binary search is to repeatedly split a list into two halves, such that you are able to tell me that one of those halves definitely does not contain the reference. We do this by comparison the reference to the "middle" element of the list. (We'll assume that the list is sorted in ascending order.) If the reference is greater than the middle element, then the left half of the list cannot contain an element equal to the reference. If, on the other hand, the reference is smaller than the middle element, then the right half of the list cannot contain an element equal to the reference. If the two are equal, then we have found the element we want and we are finished early. Otherwise, we can always throw away one half or the other as useless to us. If we reduce the size of our sublist under consideration to 0, then we are done and we know that the reference is not in the list.
This process divides our "working list" in half with each pair of comparisons we make. It actually does a little better even than that, because we can always either throw away the middle element or end the search early. But say we always throw away "at least half of the list". Then, starting from a list of length n, it takes us O(log n) to determine where the reference is located within the list, if it is present at all.
7.7 Binary Search Trees
Binary search is often implemented using a tree. If the original data is a list, then the "middle" element of that list becomes the root of the tree. Its children are the middle elements of the two halves of the list. When we have done our first comparison of the root against the reference, the result of that comparison tells us which child to look at for our next comparison. Here is an example binary search tree (BST) built from a sorted list of numbers:
14
/ \
/ \
6 22
/ \ / \
3 12 19 30
/ \ / \ / \ / \
1 4 7 13 18 21 23 33
In order to use a BST, we need to be able to create the tree and execute a lookup() method to find out whether a given reference is present in the tree. We likely also want to implement add() and delete() methods so that we can make changes to our BST without having to make a new one. In practice, we want to keep the height of our BST as short as possible so that our lookup method will run efficiently. This requires some added complexity in any add or delete method we implement, beyond what we have time to cover. (Note: If you are interested in these sorts of "self-balancing" trees, "Red-Black Trees" is a possible topic for next week.)
Not all binary search trees come from lists. Consider the titanic_fatalities data. We might want to create a binary search tree that allows us to look up a given set of feature values to find out whether training examples with those feature values survived or died. Each level of our BST considers one feature, with examples following either the left or right branch down depending on which value they have for that feature. In this case, only leaf nodes will actually have data, since every example has a value for every feature.
Task:
- Write a binary search tree for the titanic fatalities data. You should implement one function "lookup()" that takes a data point (complete with a value for each feature). This function should return a guess about whether the data point survived, based on the training data that shares all of its feature values. You will want to hard-code the BST itself, but the lookup function should be designed so that it can be used on other binary search trees you create.
7.8 Hashing
We can sometimes arrange our data in such a way that we can find it even more quickly than by binary search. One such method is using hashing, or "hash tables". A hash is a function that looks like a random number generator. It takes in data, and outputs a random-looking number within a given range. However, it only looks random. If I give the hash the same input again, it gives me the same number as output. I store my data in a hash table, which is a list large enough to contain an index for every number my hash function might spit out. I store each piece of data in the index corresponding to its hash, so my "search" consists of hashing my reference and checking whether the right piece of data is at that index.
Ideally, I select a table size large enough that there is enough space for all of my data, but otherwise as small as I can make it. The limiting factor here is that because the hashes are random-like, there is a possibility that different inputs might have the same hashed output. I want to keep these "collisions" to a minimum, because they force me to do extra searching within an entry in my table. Still, as long as I can keep the maximum number of collisions at a single point down to some constant number (perhaps by expanding my table if I see too many collisions), then I can achieve search in O(1). It is impossible to do better.
Sorting
7.9 Comparison Sorting
For this class, we are going to focus exclusively on a type of sorting called "comparison" sorting. In this type of sorting, we assume that when we are given two elements from the set we want to sort, we are able to compare them in some way and say which should come before the other. Comparison sorting is a very useful form of sorting, because we don't need to know anything else about the set we are working with or the elements within it. As long as we can compare them to each-other, we have enough information to do the job.
Each of the sorting algorithms we consider are measurable in terms of the number of comparisons that they require in order to turn an arbitrary list into a sorted list. Some will do better than others on lists with particular properties. For example, some of the algorithms will run very quickly on lists that are already sorted, while others will require almost their full running time. With one exception, we will be interested in the worst-case running time of these algorithms.
For most python code, you will simply want to use the built-in sort() method. This method uses a hybrid approach called "timsort". It mainly runs Mergesort, but at the lowest recursion levels it runs Insertion Sort instead.
7.10 Bubble Sort
Not all sorting algorithms are very good. I include Bubble Sort as a starting point. It conducts a very simple approach that does sort a list... eventually. The following algorithms work much better in all situations.
In bubble sort, we loop through the list, comparing li[0] to li[1], then li[1] to li[2], and so on. If the two elements are out of order, then we swap them. When we reach the end of the list, we repeat. The sorting ends when we have gone through an entire pass without making a single swap. At this point we know, by transitivity, that the list must be in sorted order. Bubble Sort is O(n^2) in the worst case. To see why, notice that the element which belongs at the end of the list is guaranteed to be there after the first pass. The element which belongs in the second-to-last place is then guaranteed to end up there after the second-to-last pass. After n passes, (each of which uses exactly n-1 comparisons), the sorting is complete.
7.11 Selection and Insertion Sorts
These two sorting algorithms are closely related. Selection sort works by finding the minimum value in the list. This value is placed in position li[0], and removed from the unsorted list. Then the minimum of the remaining list is found, removed, and placed into li[1]. This process is repeated until the entire list has been transferred into a sorted list.
Insertion sort places the first element from the original list into a new list. It then places the second element from the original list into its sorted position in the new list (either before or after the first element). This repeats with each successive element of the original list, until each one has been placed into the correct positions and the resulting list is a sorted version of the original list. Both Insertion Sort and Selection Sort are O(n^2) algorithms. In practice, they are much faster than Bubble Sort even though they share the same big-O running time. In the case of Selection Sort, it takes O(n) time to find each of the n minima, and O(1) time to place them appropriately in the new list. In Insertion Sort, it takes only O(1) time to find the next element to add, but it takes O(n) time to place that element in the correct position.
def insertionsort(A):
#This implementation sorts A in place
#So we do not need to use extra space
for x in range(len(A)):
cur = A[x] #Current element
done = False
y = 0
while not done:
if x-y == 0:
done = True
A[0] = cur
elif cur < A[x-y-1]:
A[x-y] = A[x-y-1]
y+=1
else:
done = True
A[x-y] = cur
Task:
- Implement Selection Sort.
7.12 Mergesort
How hard is it to sort a list, if you have two halves of the list already sorted? Mergesort asks this question. The answer is that the merging process takes O(n) time. By recursively sorting smaller and smaller sublists, then merging them back together, Mergesort manages to run in worst-case time O(n log n). This is the best running time of any comparison sort algorithm. Running Mergesort consists of two "phases". During the first phase, we recursively break our list down in half, then in half again, and again, until we are left with sublists of length 1. During the second phase, we recursively merge our lists back together. To merge two sublists, we maintain a "pointer" into each, starting by pointing at the first element of each sublist. One of these elements must be the smallest element in the merged list. We can compare them to determine which it is, add that element to the first position in the merged list, and advance that pointer one step. The second element in the merged list now must be one of the two elements that we are pointing at, and we repeat the process. In this way, we can sweep through the two sublists once each to generate our merged list.
def mergesort(A):
if len(A) == 1:
return A
left = mergesort(A[:len(A)/2])
right = mergesort(A[len(A)/2:])
return merge(left,right)
def merge(A,B):
C = []
a = 0
b = 0
while a<len(A) and b<len(B):
if A[a] <= B[b]:
C.append(A[a])
a+=1
else:
C.append(B[b])
b+=1
while a<len(A):
C.append(A[a])
a+=1
while b<len(B):
C.append(B[b])
b+=1
return C
7.13 Quicksort
Like Mergesort, Quicksort works by recursively splitting the list to be sorted in half. Unlike Mergesort, which does its work on the return journey (during the merge step), Quicksort does its work upfront. To run Quicksort, we select an element which we call the Pivot. This pivot can, in theory, be any element of the list, although in practice we would like to select one that is not too close to either the maximum or the minimum. Once we have selected a pivot, we compare each other element in the list to it, separating our list into three camps: elements less than the pivot, elements greater than the pivot, and the pivot itself. Once we have sorted the first two sublists, we can trivially put a full sorted list back together, since every element in the "less than" group goes before the pivot, and every element in the "greater than" group goes after it. We can sort the sublists recursively using Quicksort again, with the bottom level occurring where our "less than" and "greater than" groups each contain zero or one element. If we pick our pivots naively (such as by just taking the first element in the unsorted list), then Quicksort is O(n^2) in the worst case. On average, however, it runs in O(n log n) time, similarly to Mergesort.
def quicksort(A):
#Naively quicksort a list A
if len(A) <= 1:
return A
left = []
right = []
pivot = A[0]
for x in A[1:]:
if x < pivot:
left.append(x)
else:
right.append(x)
left = quicksort(left)
right = quicksort(right)
return left + right
Task:
- Consider the quicksort implementation above. I have included a small (but critical) bug in my implementation. Find it and fix it.
- Now that you have a working naive implementation, you can improve it by selecting a better pivot. Modify your algorithm so that it considers the first three elements in the list, and selects their median to be the pivot.