Graphs
8.1 Building on Trees
Recall that last week we discussed trees as an extension of linked lists, where trees are allowed to have multiple "next" elements. The resulting structures are defined by a root node, interior nodes, and leaf nodes. The root has children, but no parent node. Leaf nodes have parent nodes, but no children. Interior nodes have both.
In trees, a given node only has one parent node. This gives us the ability to talk about the depth of a node, or the height of the entire tree. If we remove that restriction, the result is a "graph".
From an implementation standpoint, the Node class we defined for trees will also work for graphs. Both trees and graphs have nodes, and the nodes in both have children (or "next" elements, if you prefer). Since there is not necessarily a root node in a graph, and so no notion of the depth of a node, we do not generally use the parent/child terminology for graphs. Instead, we refer to the set of next elements of a node as its "neighborhood". We also say that two nodes are "adjacent" if one of them is a next element of the other.
8.2 Formal Definition
A graph is defined by a set of nodes (sometimes called "vertices" or abbreviated V), and a set of edges (E) connecting them.
Nodes may have data associated with them. Edges may have properties of direction and/or weight.
If there is a sequence of edges leading from a node A to a node C (such as in example 1 below), then we say there is a path from A to C. If there is a path from some node A to itself (such as in example 3 below), then we say that the graph contains a cycle. We may also describe a path which includes some node twice as a cyclic path.
8.3 Directed versus Undirected Graphs
1) A --> B --> C 2) D -- E -- F | G 3) H -- I | | J -- K
Which of the depictions above show a linked list? A tree? A graph? Both a linked list and a tree have a notion of the order of nodes. A graph does not have an inherent starting point, but it can have a notion of order. A "directed" graph is defined in terms of nodes and "next" relationships. B in A.nex does not imply A in B.nex. A directed graph is usually drawn with arrows from node to node indicating the direction of connections. Following the edges of a directed graph, it is often possible to reach some vertex from which you cannot leave. All trees have such dead-end vertices (namely, their leaves). The first following graph shows an example of a directed graph with no dead-ends:
L --> M ^ | | v N <-- O P <--> Q --> R | v S
The graph above almost meets the requirements to be a tree, but P and Q are both in the next set of each-other. Often, we will want all edges in the graph to have this bi-directional property. Then we have what we call an "undirected" graph. We no longer need to indicate the direction of edges if all edges can be traversed in either direction, so in undirected graphs edges are usually depicted as line segments.
8.4 Self-loops and Multigraphs
While we are stretching rules and definitions, should we allow a node to be in its own next set? Should we allow a node to appear in the same next set twice? There is no definitive answer to either question. Selecting an option leads to a perfectly valid definition of a type of graph. There is no particular name for graphs that do or do not allow self-loops. In general, you should plan to allow self-loops unless you are using a graph for a setting in which a self-loop would be unreasonable. A graph that permits at most one edge from node A to node B is simply called a graph. If we permit multiple A -> B edges, then it is called a multigraph instead.
Task:
- Brainstorm data which you might describe using a graph. Do you want to allow self-loops in that setting? Is a graph or a multigraph more appropriate?
8.5 Weighted and Unweighted Graphs
For most graphs, the data of interest lies within the nodes, or in the arrangement of nodes and edges. Sometimes, we also want to associate a value with each edge. This value is most often a cost or weight associated with the edge. If, for example, our graph represented a road network, nodes might indicate cities (or intersections, for a smaller scale). Edges would then indicate roads, and we would likely want to know the length of each road. One common activity with a graph is to calculate the shortest path from one point to another, in which case the "weights" of edges becomes crucial information.
8.6 Connectedness
An undirected graph is said to be connected if there is a path between any two given nodes. Otherwise, it is called unconnected. Connectedness in undirected graphs is generally pretty easy to spot, as in the example below:
A -- B A -- B | | C -- D C -- D Connected Unconnected
In the case of a directed graph, we introduce two more terms. A directed graph is said to be strongly connected, if for any given pair of nodes, there is a path from the first to the second. (Note that there must also be a path from the second to the first, since we could have selected the nodes in the other order.) A directed graph is weakly connected if it would be a connected graph, if it were undirected. If you imagine allowing yourself to traverse each of its edges in either direction, not just the direction indicated, then you would be able to find a path from any node to any other node.
A <-- B A <-- B A --> B | ^ ^ | v | | v C --> D C --> D C --> D Strongly Connected Weakly Connected Neither
8.7 Special Graphs
There are a variety of graphs which have their own names. In this section, I mention a few of the most important ones.
- We have already seen the Tree, a special graph in which there is exactly one non-cyclic path between any given pair of nodes.
- A related special graph is the Directed Acyclic Graph (DAG), which as the name suggests is a directed graph containing no cycles. If we depict a tree as a directed graph, then it is an example of a DAG.
- The Complete Graph is a special graph where every node is adjacent to every other node.
8.8 Uses for Graphs
Most uses for graphs use nodes to indicate some type of object, and edges represent relationships between those objects.
- Networks. A node might indicate an airport, and an edge might indicate a flight from one airport to another. Edge weight might indicate the cost of the flight, or the amount of time required from start to destination.
- An interdependent task list. If each task requires some set of other tasks to be done first, we can describe this sort of interdependence relationship with an edge.
- Flow charts. Each node represents a state in some process. We draw an edge from one state to another if we can move directly (without going through intermediate states) from the first to the second. If the process is itself a program, we call this a "control flow graph".
With data described as graphs, we can answer questions such as
- What is the cheapest route from Madison to Philadelphia?
- Which classes must I take before taking CS540?
- Is it possible for this program to try to use variable x before it is defined?
Implementing Graphs in Python
In this section, I offer three different approaches to implementing a graph.
8.9 Node and Next
class GraphNode():
def __init__(self,data,nex):
self.data = data
self.nex = nex #A list of nodes adjacent to this node.
This approach is the easiest to read of the options. I recommend it for settings where clarity is a larger issue than performance. To maintain the entire graph, I would keep a list of the nodes in the graph (in some arbitrary order).
8.10 Graph as Dictionary
graph = {'P': ['Q'],
'Q': ['P','R','S']}
Each node in a graph as dictionary is indicated by some short label. A directed edge is indicated by adding the target of that edge to the source's list of neighbors. So the example above describes the following graph:
P <--> Q --> R | v S
You may notice that there are not even entries in the dictionary for nodes that do not have any departing edges. We could add such nodes to the dictionary, with empty lists as their neighborhoods. Whether we want to do so depends on our plans to use this graph. It is best used for tasks such as pathfinding that do not care about the data stored within nodes.
8.11 Adjacency Matrix
0100 1011 0000 0000
An adjacency matrix representation uses the fact that every edge is either present or not to build a matrix describing the complete graph, then entering 1s or 0s into the matrix to indicate whether a particular edge is present. This approach allows for very rapid calculation of certain types of questions, although the details of those techniques is outside the scope of this class. The example graph from the previous section is shown above in adjacency matrix form. Note that even the short node labels P, Q, R, and S have disappeared.
You can represent an adjacency matrix in python using a list of lists, but you might use a numPy "matrix" instead.
import numpy as np
a = np.matrix('0 1 0 0; 1 0 1 1; 0 0 0 0; 0 0 0 0')
Graph operations: Searching
8.12 Overview
In order to use graphs effectively, we need an orderly way to traverse the data they hold. In this section, we will examine three approaches to traversing data in a graph. The first two approaches are used for searching; they differ in the order in which they look at the nodes of a graph. The third approach we will consider is used for finding specifically the shortest path between two nodes.
8.13 Depth-First Search
Depth-first search (DFS) can be used to answer a variety of questions. These include
- Is the graph connected?
- Is there a path from A to B?
- Is there a cycle?
- Is there an ordering of the nodes such that every node comes before all nodes in its neighbor set?
We begin DFS at some arbitrary node X. We mark this node as "visited". We then recursively apply DFS to each unvisited neighbor of X. We keep the "visited" marker, plus any additional information about a node we need to save, within the node itself.
For example, let's consider how to determine whether the following graph is connected.
A | B -- C | | D -- E -- F
If we begin at A, we mark A visited and maintain a count of the number of successfully visited nodes (currently at count = 1). We then call our DFS procedure on all of A's children. B has not been visited yet, so we mark it, update the counter, and call DFS on children A, C, and D. Notice that we call DFS on A even though we know we have already handled that node. We look at A first, note that it has already been visited, and end that DFS call without further action. Let's assume that we tackle C before D. We mark C, update the counter, and call DFS on E. This is where the "depth-first" name comes into play, because we mark E before we mark D. In fact, we mark E, and then call DFS on its children D and F (and C, which we already visited). We mark D, note that we have already visited all of its children, and end that branch. Similarly, we mark F, another dead end, and close that DFS call. This leaves us with only one call open, namely the investigation of node D that B started. Since we have already visited D by this point, we close that call as well. Lacking any active calls to the DFS procedure, the algorithm is complete. Finally, we compare the counter we have maintained to the known total number of nodes in the graph. Since the numbers are the same, we conclude that the graph is connected.
Task:
- Add direction to the edges of the graph above so that it is connected, but not strongly connected. Can you force DFS to give you a different answer depending on which node you start at?
- Suppose I asked you to use DFS to tell me whether there is a path from A to E in the graph above. How would you go about it? What information would you record?
8.14 Breadth-First Search
Breadth-first search (BFS) proceeds similarly to DFS, and can be used for most of the same questions. It differs in that, while DFS maintains a call stack, BFS maintains a queue. Consider the graph above, and the example DFS search. We marked A->B->C->E->D->F. After we marked B, we put both C and D on the stack for consideration, but C placed more nodes onto the stack before we could get around to considering D. In BFS, once we have placed C and D into our queue, they will be the next two nodes marked (or at least considered for marking), regardless of what we find in C. Below, see the sequence of states of the search. I have included the set of marked nodes and the "to do" queue according to BFS. (I've included the back edges in the to do list.)
MARKED TO DO (empty) A A B AB A,C,D ABC D,B,E ABCD B,E,B,E ABCDE B,E,C,D,F ABCDEF E ABCDEF (empty)
8.15 Dijkstra's Algorithm
BFS can be used to find the shortest path from one node to another, but this only works if the graph is unweighted. To find a shortest path in a weighted graph, we can use Dijkstra's algorithm. This algorithm runs in time O(E log N), where E is the number of edges in the graph and N is the number of nodes.
Suppose you want to find the shortest path from node S to node T. Dijkstra's algorithm begins at node S and labels each node with the distance of the shortest path to get from S to that node. We begin by labeling S itself with "0". We add each of S's neighbors to our "to do" list, but we make this list a priority queue so that the neighbor with the shortest edge comes first. Call this node A. Since all other edges leaving S are at least as long as S->A, the shortest path from S to A cannot be shorter than the path which just takes the S->A edge. Label A with this distance, and add its neighbors to the priority queue. But when we add them, we give them a priority that is weighted by the distance to get to A. So if there is a node B where S->B has length 10, S->A has length 8, and A->B has length 4, B will have two entries in the priority queue at positions 10 and 12. Having handled node A, we can consider the new first element in the priority queue, and we once again know that no other path can be shorter.
The algorithm completes when we finally add a label to our destination node T.
Tasks:
- What information would you need to store in order to retrieve the shortest path used by Dijkstra's algorithm?
- Dijkstra's algorithm is used on graphs with entirely positive edge weights. What happens if we introduce an edge with a weight of 0, or a negative weight?
- What happens if we use Dijkstra's algorithm on an unweighted graph? Will it work? What other algorithm does it resemble?
Miscellaneous Topics
8.16 Command Line Arguments
We can run python code from the command line using the shell command "python" followed by the name of the file we want to run. We can add additional arguments to the command following the file name. These additional arguments do not do anything on their own, but they are recorded. We can access them using the argv variable in the sys module. Try placing the following code into a file. You will need to run this file outside of spyder.
import sys
print "The number of arguments is always at least 1:",len(sys.argv)
print "The arguments are:",sys.argv
print "Notice that the filename is the first argument."
If you want to use command line arguments without enforcing a particular order to those arguments, you can use the argparse module. Documentation for this module can be found here.
8.17 Hashing
A hash function is an example of a "trapdoor" or "one way" function. It is fairly easy to compute in one direction, but very difficult to compute in the other direction. To get the basic idea of such a function, consider the equation x^2 = y. Given x, computing y requires one multiplication step (trivial). Given y, we must compute the square root function in order to find x (much harder, although perfectly doable). Most trapdoor functions are used for information security, and have immensely higher difficulty. The idea is that an adversary cannot access your data if they don't have enough computing power to compute your function backwards.
A hash function is a trapdoor function that maps some arbitrary data to a number between 0 and x, corresponding to an index in a table. A hash function that is designed correctly will have a roughly equal "probability" of sending a given input to any index in the table. (The function is pseudorandom, not truly random, so we write "probability" in scarequotes.) Computing this function, and thus placing a new input into the table, takes constant time. The hash function's runtime does not depend at all on the number of elements already stored in the table.
A "collision" occurs if we try to put some element into the table, and its index is already full. This can happen if we have already put that particular element into the table. In that case, we probably want to overwrite the contents. However, it can also happen if we, by bad luck, happen to have two different elements whose hash mappings are the same index. In this case, the usual approach is to start a list to hold the items that collided at that particular index. We store a reference to the list at that index in the hash table. As long as the number of collisions does not grow too large, the hash table still functions very efficiently.
To read out of a hash table, compute the hash of the element you are looking for, and look at that index. If the element is present within the table, it must be at that exact location. So both reading and writing are constant time operations in a hash table.
The python dict type is implemented using a hash table. For this reason, it is very fast (usually optimal) to use a dict. If you want to use a hashing function directly, take a look at the hashlib module, documented here.
8.18 Large File Handling
Do not run the following code!
from random import randint
f = open('bigfile.txt','w')
s = 1000
r = range(s)
for i in r:
for j in r:
for k in r:
f.write(str(randint(1,s)))
f.close()
This program (on my laptop, anyway) slowly builds up a text file roughly 3GB in size over the course of an hour. In order to compare different methods for storing data, we'll need to use some smaller examples.
...
for i in r:
for j in r:
f.write(str(randint(1,s)))
f.close()
This version should run about 1000 times faster. On my laptop, it creates a 2.9MB text file in 3.2 seconds.
from random import randint
f = open('bigfile.txt','w')
s = 1000
r = range(s)
arr = [[randint(1,s) for j in r] for i in r]
f.write(str(arr))
f.close()
This version creates an entire list of lists of data before doing any writing out to the file at all. This version creates a 4.9MB text file in 2.6 seconds. The extra space required comes from the need to record the various brackets in the str rendering of the list of lists.
import cPickle as pickle
s = 1000
r = range(s)
arr = [[randint(1,s) for j in r] for i in r]
pickle.dump(arr, open('bigfile','wb')) #Note: It's not a text file anymore.
This version creates a 5.9MB file in 2.8 seconds, quite similar to the results of the previous try. However, pickling is much more flexible. We can use it to store nearly any object, and it retains all of the information about that object. Loading a pickled file back into a python program is a one-line call:
arrcopy = pickle.load(open('bigfile','rb'))
More documentation on the pickle module can be found here.
8.19 Large Dictionary Handling
While we can certainly pickle a large dictionary, we may not want to load the entire dictionary into memory every time we want to execute a lookup from that dictionary. Building a pickled dictionary using the following code results in a 24.6MB file in 5.2 seconds. Loading it and reading a single random entry takes 1.2 seconds.
from random import randint
import cPickle as pickle
s = 1000
r = range(s)
d = {}
for i in r:
for j in r:
d[str(i)+ " " + str(j)] = randint(1,s)
pickle.dump(open('bigfile','wb'))
...
dcopy = pickle.load(open('bigfile','rb'))
a = randint(0,s)
b = randint(0,s)
print d[str(a) + " " + str(b)]
We can use the shelve module to improve this further. The code below creates the same dictionary, taking 100.0MB of space and running in 15.7 seconds (on my laptop). The process of reading a single entry, however, now only requires 0.005 seconds.
import shelve
d = shelve.open('bigfile')
for i in r:
for j in r:
d[str(i) + " " + str(j)] = randint(1,s)
d.close()
...
d = shelve.open('bigfile')
a = randint(0,s)
b = randint(0,s)
print d[str(a) + " " + str(b)]
d.close()
More documentation on the shelve module can be found here