Trees

Introduction: Trees and Binary Trees
Binary Search Trees
- Implementing BSTs
Balanced Trees
Answers to Self-Study Questions

Introduction

Sequences, stacks, and queues, are all linear structures: in all three data structures, one item follows another. Trees will be our first non-linear structure:

More than one item can follow another.
The number of items that follow can vary from one item to another.

Trees have many uses:

representing family geneologies
as the underlying struture in decision-making algorithms
to represent priority queues (a special kind of tree called a heap)
to provide fast access to information in a database (a special kind of tree called a b-tree<.b>)

Here is the conceptual picture of a tree (of letters):
A / | \ B C D /| | E F G / \ H I | J

each letter represents one node
the lines connecting them (should be arrows) are called edges
the topmost node (with no incoming edges) is the root
the bottom nodes (with no outgoing edges) are the leaves
So a (computer science) tree is kind of like an upside-down real tree...
A path in a tree is a sequence of (zero or more) connected nodes; for example, here are 3 of the paths in the tree shown above:
A -> B -> F C -> G D
The length of a path is the number of nodes in the path, e.g.:
Path Length ---- ------ A -> B -> F 3 C -> G 2 D 1
The height of a tree is the length of the longest path from the root to a leaf; for the above example, the height is 5 (because the longest path from the root to a leaf is: A->C->G->I->J).
The depth of a node is the length of the path from the root to that node; for the above example:

the depth of I is 4
the depth of J is 5
the depth of A is 1
Given two connected nodes like this:
A | B
Node A is called the parent, and node B is called the child.
Given this picture:
A / \ ... ...
The left "..." is the left subtree of A, and the right "..." is the right subtree of A.
NOTE: We will treat trees more like linked lists than like lists; i.e., as low-level data structures to be used to implement abstract data types, rather than as abstract data types themselves
An important special kind of tree is the binary tree. In a binary tree:

Each node has 0, 1, or 2 children.
Each child is either a left child or a right child.

A binary tree is balanced if its left and right subtrees have the same height. For example, here are some binary trees with the height of each node indicated:
3 3 3 4 / \ / \ / \ / \ 2 1 2 2 2 2 2 3 / \ / \ / \ / \ / \ / \ 1 1 1 1 1 1 1 1 1 1 1 2 / 1 not balanced not balanced balanced not balanced

Representing Trees
Since a binary-tree node never has more than two children, a node can be represented using a class with 3 fields: one for the data in the node, plus two child pointers:

class Treenode { Object data; Treenode leftChild; Treenode rightChild; }

However, since a general-tree node can have an arbitrary number of children, a fixed number of child-pointer fields won't work. There are several other possibilties:

Use an array of child pointers: Treenode[] children;. Start with an array of some reasonable size and expand the array if necessary (if the array is full and a new child is added).
Use a linked list of child pointers. First, define a class to represent the nodes of the list:

class Listnode { Treenode data; Listnode next; }

Then define a field of the Treenode class to point to the first node in the list: Listnode children;.
Use a Sequence of child pointers: Sequence children;. (Note that the sequence can be implemented using either an array or a linked list.)
???Do we want to include the representation where each node has a pointer to its first child as well as a pointer to its next sibling??
Here are pictures showing how the tree given at the beginning of these notes would be represented using each of the 3 possibilities. In the case of using a Sequence, it is assumed that the Sequence itself is implemented using a linked list.
MISSING PICTURE

Tree Traversals

It is often useful to iterate through the nodes in a tree:

to print all values
to determine if there is a node with some property
to make a copy
When we iterated through a sequence, we started with the first node and visited each node in turn. For trees, there are many different orders in which we might visit the nodes. There are three common traversal orders for general trees, and one more for binary trees: preorder, postorder, level order, and in-order, all described below. We will use the following tree to illustrate each traversal:
A / \ B C / / \ D E F \ / \ G H I

Preorder
A preorder traversal can be defined (recursively) as follows:

visit the root
for each subtree T, from left to right: visit T in preorder
If we use a preorder traversal on the example tree given above, and we print the letter in each node when we visit that node, the following will be printed: A B D C E G F H I.
Postorder
A postorder traversal is similar to a preorder traversla, except that the root of each subtree is visited last rather than first:

for each subtree T, from left to right: visit T in postorder
visit the root
If we use a postorder traversal on the example tree given above, and we print the letter in each node when we visit that node, the following will be printed: D B G E H I F C A.
Level order
The idea of a level-order traversal is to visit the root, then visit all nodes "1 level away" from the root (left to right), then all nodes "2 levels away" from the root, etc. For the example tree, the goal is to visit the nodes in the following order:

A B C D E F G H I | |__| |_____| |_____| | | | 3 levels away | | | | | 2 levels away | | | 1 level away | root

A level-order traveral requires using a queue (rather than a recursive algorithm, which implicitly uses a stack). Here's how to print the data in a tree T in level order, using a queue Q:

Q.enqueue(root of T) while (!Q.isEmpty()) { n = Q.dequeue(); print n.data; for each child m of n from left to right: Q.enqueue(m); }

In-order
An in-order traversal involves visiting the root "in between" visiting its left and right subtrees. Therefore, an in-order traversal only makes for binary trees. The (recursive) definition is:

visit the left subtree in in-order
visit the root
visit the right subtree in in-order
If we print the letters in the node's of our example tree using an in-order traversal, the following will be printed: D B A E G C H F I

TEST YOURSELF #1
What is printed when the following tree is visited using a (a) preorder traversal, (b) a postorder traversal, (c) a level-order traversal, and (d) an in-order traversal?

A / \ B C / \ / \ D E F G \ / / \ H I J K

solution

Binary Search Trees

An important special kind of binary tree is the binary search tree (BST). In a BST, each node stores some information including a unique key value. A tree is a BST iff, for every node n in the tree:

All keys in n's left subtree are less than the key in n, and
all keys in n's right subtree are greater than the key in n.
Note: if duplicate keys are allowed, then nodes with values that are equal to the key in node n can be either in n's left subtree or in its right subtree (but not both).
Here are some examples, assuming that each node stores just an integer key:
6 6 4 4 / \ / \ / \ / \ 4 9 4 9 2 6 2 7 / \ / \ / \ / \ / \ / \ 2 5 2 7 1 3 5 9 1 3 5 6 a BST Not a BST a BST Not a BST (7 not < 6) (6 not > 7)

Q: Why care about BSTs?
A: They provide a good way to implement the dictionary abstract data type. Conceptually, a dictionary is a collection of unique keys possibly with some associated information. The dictionary operations include:

Add an entry.
Remove an entry.
Lookup an entry; if each entry includes both a key and some associated information, then if the key is in the dictionary, return the associated information or a special value to indicate that the key was not found. If each entry is just a key, then return true or false to indicate whether it was found.
Print all entries in sorted order.
BSTs are good for implementing dictionaries because as long as the tree is (close to) balanced, the insert, remove, and lookup operations can be implemented to take O(log n) time, where n is the number of dictionary entries. This is because:

These operations involve (at worst) following one path from the root to a leaf (plus some constant extra work), and
the length of such a path is log n.
The printInOrder operation will be O(n) (and it can't be better than that since all n values have to be printed).

Implementing BSTs

To implement a binary search tree, we need two classes: one for the individual tree nodes, and one for the BST itself:

class treeNode { Object data; treeNode left, right; } class BST { // fields treeNode root: // ptr to the root of the BST // methods public void insert(Object ob) {...} // add ob to this BST public void delete(Object ob) {...} // remove ob from this BST if it is there public bool lookup(Object ob) {...} // return true iff ob is in this BST public void print(PrintWriter p) {...} // print values in order to p }

Let's think about the lookup method first. { // base cases // (1) T is empty - return false // (2) k is in the node pointed to be T - return true if (NULL == T) return (false); if (k == T->data) return (true); // recursive cases: look in left or right subtree depending on relationship // of k to value in node pointed to by T if (k < T -> data) return (Lookup (T -> left, k)); else return (Lookup (T -> right, k)); } time for Lookup . always follows a path from root down; worst-case, goes all the way to a leaf . time depends on "shape" of tree: worst case: all nodes have one child (tree really just a linked list) time is O(n), n = # nodes in tree best case: tree is as balanced as possible (leaf depths differ by at most 1, only parents of leaves have just 1 child) time = O(log (n)) average case: considering all possible lookups in all possible trees w/ n nodes: O(log n) 2. Insert void Insert (Tree & T, int k) { // insert k into T (as a new leaf) maintaining BST properties // Note: T itself may change, so is passed by reference if (NULL == T){ // here's where T itself gets changed T = new treeNode; T -> data = k; T -> left = T -> right = NULL; } else if (k < T -> data) Insert (T -> left, k); else if (k > T -> data) Insert (T -> right, k); } Q: what to do if k is already in the tree? A: nothing, or error, or insert it a second time time for Insert . like Lookup, in worst case, must follow path from root to leaf so: for tree w/ n nodes . worst-case time is "linear" tree: O(n) . in a balanced tree, worst-case time is O(log n) . average time is O(log n) 3. Remove (the following code just LOCATES the node to be removed; more code coming up) void Remove (Tree & T, int k) { // find the node to be removed - the (first) one that contains k // (error if no such node) // remove it from the tree; return storage assert (T != NULL); if (T -> data == k){ // this is the node to be removed . . . } else if (k < T -> data) Remove (T -> left, k); else Remove (T -> right, k); } Note: could also decide that Remove of non-existent value is just a no-op: . remove assert . add condition (k > T -> data) to second "else" Remove continued: What to do once T -> data == k? case 1: T is a leaf . free the storage . set T to NULL if ((T -> left == NULL) && (T -> right == NULL)) { delete T; T = NULL; } case 2: T has just one child "replace" T w/ its child . don't lose the child (use a tmp ptr) . free the storage of the removed node . set T to point to the child if (T -> left == NULL)) { treeNode * tmp = T -> right; delete T; T = tmp; } else if (T -> right == NULL) { ... similar code... } case 3: T has two kids . we can't just remove the node leaving a "hole" in the tree . we can't replace it with a child, because what would we do with the other child? * solution: replace the value at the node w/ the value from some other node lower down in the tree, then (recursively) remove that other node must choose that "other" value so that we retain BST properties; i.e., it still must be true that all values in the left subtree are less than the "other" value, and all values in the right subtree are greater than the "other" value Q: what value can we use so that these properties are maintained?? A: either the largest value from the left subtree, or the smallest value from the right subtree . we'll arbitrarily choose the largest value from the left subtree, so: (1) find the largest value from T's left subtree (2) replace T -> data w/that value (3) remove that value from T's left subtree else { int tmp = Max(T -> Left); T - > data = tmp; Remove (T -> left, tmp); } } // end function Remove Summary of Remove operation: step 1: find node to be removed step 2: case 1 node is leaf - remove it case 2 node has one child - replace node w/ child case 3 node has two children . replace value in node w/ max of left subtree . recursively remove that value from left subtree time for Remove . cases 1 and 2: find node to be removed (follow path down from root); do O(1) work at that node time = length of path (same as Lookup, Insert) . case 3 (a) find node to be removed (follow path down from root) (b) get max value in left subtree (finish following path down) (c) recursive call on Remove starting w/ root of left subtree note: recursive call must be case 1 or case 2 (you should be able to say why!) so its time is proportional to height of left subtree So all of case 3 is, in the worst case, proportional to height of tree (same as Insert and Lookup). 4. PrintInOrder recall: if node n holds value k: (1) all values in n's left subtee are < k i.e., should be printed first (before printing k) (2) all values in n's right subtree are > k i.e., should be printed after printing k So, to print all values in tree T in order: (1) (recursively) print all values in left subtree (2) print value @ root of T (3) (recursively) print all values in right subtree This is called an IN ORDER traversal of T void PrintInOrder (Tree T) { if (T != NULL) { PrintInOrder (T-> left); // time = size of left tree cout << T -> data << " "; // 0(1) PrintInOrder (T -> right); // time = size of right tree } } total time = O(n) # nodes in tree, regardless of tree shape Other traversal orders: PreOrder PostOrder code similar to PrintInOrder: Preorder: print the root print the left subtree in preorder print the right subtree in preorder Postorder: print the left subtree in postorder print the right subtree in postorder print the root -------------------- END BSTs -------------------- ------------------------ | | | NEW TOPIC: 2-3 TREES | | | ------------------------ Problem: worst-case time for Lookup, Insert, Remove in BST: O(n) (when tree is unbalanced) Solution: BALANCED TREES height ALWAYS O(log n) n = # of nodes in tree We will look at 1 kind of Balanced Tree: 2-3 Tree Others are in book (not on exam). 2-3 Tree: . Every non-leaf has either 2 or 3 children . All leaves are at the same depth . Information (keys) in a 2-3 tree is stored ONLY at leaves (internal nodes are for organization only) . Info at leaves is ordered left to right . Each internal node has child ptrs. and (1) value of max key in LEFT subtree (leftMax) (2) " " " " " MIDDLE subtree (middleMax) Note: if only 2 kids, they are Left, Middle (not left, right) Example ------- ------------ | 4 | 12 | ------------ / | \ / | \ ------------ ------------ ------------ | 2 | 4 | | 7 | 10 | | 15 | 20 | ------------ ------------ ------------ / | / | \ / | \ 2 4 7 10 12 15 20 30 Operations on a 2-3 Tree ------------------------ 1. Lookup: look up value k in tree T Base cases: (1) T is empty (NULL): return false (2) T is just a leaf node: return true iff value @ node == k Recursive cases: . k < leftMax: look up k in left subtree . leftMax < k < middleMax: look up k in middle subtree . middleax < k: Lookup k in right subtree time for Lookup: . # calls = height of tree . height of tree is O(log n) for n = # NODES in tree . actual values only at leaves but # leaves > n/2 (i.e., more than 1/2 the nodes in the tree are leaves) so time is O(log m) for m = # key VALUES in tree 2. Insert: insert value v into tree T, maintaining 2-3 tree properties Step 1: Find the node n that will be the parent of the new node i.e. do not search all the way down to a leaf; stop @ a parent of (2 or 3) leaves note: This requires special-case code for empty trees and for trees w/ a single node so form of Insert will be: if tree is empty ... else if tree is just 1 node ... else call Insert1 (T, v) where Insert1 is the recursive fn that handles all but the 2 special cases To find n, parent of new node: . base case: T's kids are all leaves - found! (n is T) . recursive cases: v < LeftMax: insert v into left child v < middleMax or only 2 kids: insert v into middle child v > middeMax and 3 kids: insert v into right child Once n is found: Case 1: n has only 2 children Insert v as appropriate child of n: (1) v < LeftMax(n) make v n's leftchild (move others over) fix values LeftMax(n) and MiddleMax(n) no possibility of change to an ancestor's LeftMax or MiddleMax (because new value not max child) (2) v between LeftMax(n) and Middlemax(n) make v n's middle child fix Middlemax(n) (3) r > MiddleMax(n) make v n's Right child fix MiddleMax fields of n's ancestors as needed Case 2: n already has 3 kids (1) make v the appropriate new child of n, anyway now n has 4 kids (2) create new internal node m - give m n's two rightmost kids (fix n's, m's leftMax, middleMax) (3) add m as appropriate new child of n's parent if n's parent had only 2 kids - quit else keep creating new nodes recursively up the tree if the root is given 4 kids create new node m as above create new root w/ kids n and m (4) fix leftMax and middleMax of ancestors as needed time for Insert: step 1: (find node n) involves following a path from root to parent of leaves: O(height of true) = O(log n) step 2: worst case involves adding new nodes all the way back up from leaf to root, also O(log n) So total time is O(log n). 3. Remove: remove value k from tree T step 1: Find n, parent of node to be removed (special case first for T just one node containing k - delete it, make T NULL) step 2: case 1: n has 3 kids remove kid w/ value k fix leftMax, middleMax at n and n's ancestors case 2: n has only 2 kids 2a: n is the root of the tree remove node w/ k and root leaving other kid as entire tree 2b: n has a left or right sibling w/ 3 kids . remove node w/ k . "steal" one of sibling's kids . fix leftMax, middleMax of n, sibling, ancestors 2c: sibling(s) have only 2 kids . remove node w/ k . make remaining kid a child of n's sibling . fix leftMax, middlemax time for Remove: (similar to Insert) worst case involves 1 traversal down to find n + another "traversal" up removing nodes along the way (traversal up is really actions that happen after the recursive call has finished) So total time is 2 * height = O(log n) DISCUSSION: How to define a 2-3 tree node? Leaf and non-leaf nodes store different things: leaf: key value non-leaf: leftMax, middleMax, 3 child ptrs Also, we need to be able to tell when a node is a leaf. . easiest: use struct w/ all fields: struct TreeNode { bool isLeaf; int key; int leftMax, middleMax; TreeNode *left, *middle, *right; }; . could save some space by using one field for both key and leftMax using left child == NULL to test for "isLeaf" in this case, probably want to define functions as follows: (good idea anyway so that actual representation can change!) bool IsLeaf (TreeNode *T) { return (T -> leftChild == NULL);} int Key (TreeNode *T) { assert (IsLeaf (T)); return (T -> leftMax); } int LeftMax (TreeNode *T) { assert (! IsLeaf (T))); return (T -> leftMax); } etc. 2-3 TREE SUMMARY ================ o info is stored only at leaves, ordered left-to-right o non-leaf nodes have 2 or 3 kids (not 1) o non-leaf nodes also have leftMax, middleMax values (as well as pointers to children) o all leaves are at same depth o height of tree is O(log n) n = # nodes in tree o at least half the nodes are leaves, so height of tree is also O(log n) for n = # values stored in tree SUMMARY: TREE DICTIONARIES =========================== BST 2-3 Tree --- -------- where are every node leaves only values stored extra info @ 2 child LeftMax, MiddleMax, nodes ptrs. 3 child ptrs. worst-case time O(n) O(log n) for Lookup, Insert, Remove (n = # values stored in tree) Representing Binary Trees Using Arrays ====================================== Method 1: use 3 arrays to hold: values, left child "ptrs", right child "ptrs" (a pointer is really the INDEX in which information about the child is stored in the array) Example ------- H value left right / \ -------------------- B K [0] | H | 1 | 2 | \ -------------------- D [1] | B | -1 | 3 | -1 means no child / \ -------------------- C F [2] | K | 1 | 2 | -------------------- [3] | D | 5 | 4 | -------------------- [4] | F | -1 | -1 | -------------------- [5] | C | -1 | -1 | -------------------- . if nodes can be REMOVED, must maintain free list (linking via "right child" array) Example ------- before removing anything; firstFree is the index of the first free space in the array: H value left right / \ -------------------- B K [0] | H | 1 | 2 | \ -------------------- D [1] | B | -1 | 3 | / \ -------------------- C F [2] | K | 1 | 2 | -------------------- [3] | D | 5 | 4 | -------------------- [4] | F | -1 | -1 | -------------------- [5] | C | -1 | -1 | -------------------- [6] | ? | ? | 7 | <-- next free space is -------------------- array[7] [7] | ? | ? | 8 | <-- next free space is -------------------- array[8] firstFree: 6 [8] | ? | ? | -1 | <-- no more free spaces -------------------- after removing F: H value left right / \ -------------------- B K [0] | H | 1 | 2 | \ -------------------- D [1] | B | -1 | 3 | / -------------------- C [2] | K | 1 | 2 | -------------------- [3] | D | 5 | 4 | -------------------- [4] | ? | ? | 6 | <-- next free space is -------------------- array[6] [5] | C | -1 | -1 | -------------------- [6] | ? | ? | 7 | <-- next free space is -------------------- array[7] [7] | ? | ? | 8 | <-- next free space -------------------- is array[8] firstFree: 4 [8] | ? | ? | -1 | <-- no more free spaces -------------------- . "Note: when a node is "removed", that space is added to front of free list Method 2: single array of values if there is a special "empty" value, else 2 arrays: values & booleans . root's value is stored in A[1] . if node's value is in A[n] left child is in A[n*2] right child is in A[n*2+1] . if a node has NO left child, A[n*2] contains the special "empty" value (similarly for no right child) if there is no special "empty" value, then the 2nd array contains "false" for every "empty" position in the 1st array Example (use "" as the special "empty" value) ------- H value / \ ------- B K [1] | H | \ ------- D [2] | B | / \ ------- C F [3] | K | ------- [4] | | ------- [5] | D | ------- [6] | | ------- [7] | | ------- [8] | | ------- [9] | | ------- [10] | C | ------- [11] | F | -------

Trees

Contents

Introduction

Representing Trees

Tree Traversals

Binary Search Trees

Implementing BSTs