Joe Celko's Trees and Hierarchies in SQL for Smarties

Joe Celko's Trees and Hierarchies in SQL for Smarties

5.0 1
by Joe Celko
     
 

View All Available Formats & Editions

In his previous book, (SQL for Smarties, 2e, MK, 1999) Joe Celko wrote two chapters on programming techniques for representing trees and hierarchies in SQL. Ever since then, he has been bombarded with questions requesting more explanation of these topics. Joe has decided to meet the demand for this information by addressing the topic in depth in his new full-length… See more details below

Overview

In his previous book, (SQL for Smarties, 2e, MK, 1999) Joe Celko wrote two chapters on programming techniques for representing trees and hierarchies in SQL. Ever since then, he has been bombarded with questions requesting more explanation of these topics. Joe has decided to meet the demand for this information by addressing the topic in depth in his new full-length book. Trees and hierarchies are a way to organize information and appear everywhere in computer science, from indexing structures (i.e. B-Tree indexing) to encoding schemes (i.e. Dewey Decimal system for libraries) to hierarchical databases. Even XML and related markup languages-which interact with databases-are based on tree structures. Every SQL programmer is faced with the challenge of creating these structures, which are not easy to master and have far-reaching programmatic effects.

Editorial Reviews

From the Publisher
"I want to say clearly that I think the subject of this proposed book is one for which there will be considerable demand...the topic is poorly understood in general and a good book on the subject will be helpful to the SQL community at large. This book should be of great interest to real-world application programmers...I think that this book would be used on a day-to-day basis (rather than languish on a shelf until some special problem arose)." -Jim Melton, author of SQL:1999.

Product Details

ISBN-13:
9780123877338
Publisher:
Elsevier Science
Publication date:
02/03/2012
Series:
Morgan Kaufmann Series in Data Management Systems Series
Pages:
296
Sales rank:
1,345,123
Product dimensions:
7.40(w) x 9.20(h) x 0.90(d)

Read an Excerpt

JOE CELKO'S TREES AND HIERARCHIES IN SQL FOR SMARTIES


By Joe Celko

MORGAN KAUFMANN

Copyright © 2004 Elsevier Inc.
All right reserved.

ISBN: 978-0-08-049169-1


Chapter One

Graphs, Trees, and Hierarchies

LET'S START WITH a little mathematical background. Graph theory is a branch of mathematics that deals with abstract structures, known as graphs. These are not the presentation charts that you get out of a spreadsheet package.

Very loosely speaking, a graph is a diagram of "dots" (called nodes or vertices) and "lines" (edges) that model some kind of "flow" or relationship. The edges can be undirected or directed. Graphs are very general models. In circuit diagrams the edges are the wires and the nodes are the components. On a road map the nodes are the towns and the edges are the roads. Flowcharts, organizational charts, and a hundred other common abstract models you see every day are all shown as graphs.

A directed graph allows a "flow" along the edges in one direction only, as shown by the arrowheads, whereas an undirected graph allows the flow to travel in both directions. Exactly what is flowing depends on what you are modeling with the graph.

The convention is that an edge must join two (and only two) nodes. This lets us show an edge as an ordered pair of nodes, such as (Atlanta, Boston) if we are dealing with a map, or (a, b) in a more abstract notation. There is an implication in a directed graph that the direction is shown by the ordering. In an undirected graph we know that (a, b) = (b, a), however.

A node can sit alone or have any number of edges associated with it. A node can also be self-referencing, as in (a, a).

The terminology used in graph theory will vary, depending on which book you had in your finite math class. The following list, in informal language, includes the terms I will use in this book.

Order of a graph: number of nodes in the graph

Degree: the number of edges at a node, without regard to whether the graph is directed or undirected

Indegree: the number of edges coming into a node in a directed graph

Outdegree: the number of edges leaving a node in a directed graph

Subgraph: a graph that is a subset of another graph's edges and nodes

Walk: a subgraph of alternating edges and nodes that are connected to each other in such a way that you can trace around it without lifting your finger

Path: a subgraph that does not cross over itself—there is a starting node with degree one, an ending node with degree one, and all other nodes have degree two. It is a special case of a walk. It is a "connect-the-dots" puzzle.

Cycle: a subgraph that "makes a loop," so that all nodes have degree two. In a directed graph all the nodes of a cycle have outdegree one and indegree one (Figure 1.1).

Connected graph: a graph in which all pairs of nodes are connected by a path. Informally, the graph is all in one piece.

Forest: a collection of separate trees. Yes, I am defining this term before we finally get to discussing trees.

There are a lot more terms to describe special kinds of graphs, but frankly, we will not use them in this book. We are supposed to be learning SQL programming, not graph theory.

The strength of graphs as problem-solving tools is that the nodes and edges can be given extra attributes that adapt this general model to a particular problem. Edges can be assigned "weights," such as expected travel time for the roads on a highway map. Nodes can be assigned "colors" that put them into groups, such as men and women. Look around and you will see how they are used.

1.1 Modeling a Graph in a Program

Long before there was SQL, programmers represented graphs in the programming language that they had. People used pointer chains in assembly language or system development languages such as 'C' to build very direct representations of graphs with machine language instructions. However, unlike the low level system development languages, the later, higher-level languages, such as Pascal, LISP, and PL/I, did not expose the hardware to the programmer. Pointers in these higher level languages were abstracted to hide references to the actual physical storage and often required that the pointers point to variables or structures of a particular type. (See PL/I's ADDR() function, pointer datatypes, and based variables as examples of this kind of language construct.)

Traditional application development languages do not have pointers, but they often have arrays. In particular, FORTRAN only had arrays for a data structure; a good FORTRAN programmer could use them for just about anything. Early versions of FORTRAN did not have character-string data types—everything was either an integer or a floating-point number. This meant the model of a graph had to be created by numbering the nodes and using the node numbers as subscripts to index into the arrays.

Once the array techniques for graphs were developed, they became part of the "programmer's folklore" and were implemented in other languages. I will use a pseudocode to explain the techniques.

1.1.1 Adjacency Lists for Graphs

The adjacency list model represents the edges of the graph as pairs of nodes, similar to the following computer code:

DECLARE ARRAY GraphList OF RECORD [edge CHAR(1), in_node INTEGER, out_node INTEGER];

With data:

GraphList edge in_node out_node 'a' 1 2 'b' 2 3 'c' 4 2 'd' 1 4

The algorithms that we used were based on loops that made the connections between two edges, in which the in_node of one row was equal to the out_node of another row.

1.1.2 Adjacency Arrays for Graphs

Many of the computational languages had library functions for matrix operations; therefore it was logical to put the graph into an array where it could be manipulated with these functions.

Given (n) nodes, you could declare an (n) by (n) array with zeros and ones in the cells. A "one" meant that there was an edge between the two nodes represented by the row and column of the cell, and a "zero" meant that there was not.

You can actually represent a two-dimensional array; for example: A[0:5, 0:5], with a table like this:

CREATE TABLE Array_A (edge CHAR(10) NOT NULL, i INTEGER NOT NULL UNIQUE CHECK (i BETWEEN 0 AND 5), j INTEGER NOT NULL UNIQUE CHECK (j BETWEEN 0 AND 5), PRIMARY KEY (i, j));

I have a chapter in SQL For Smarties on how to do basic matrix math with such tables. However, because SQL was not meant to be used this way, the code to implement the old Adjacency Array algorithms is rather baroque. Array was added to SQL-99 as a "collection type" for columns, but it is not widely implemented and has serious limitations—it is a vector, or one-dimensional array, and not a full multidimensional structure.

1.1.3 Finding a Path in General Graphs in SQL

There is a classic problem in graph theory that illustrates how expensive it can be to do general graphs in SQL. What we want is a list of paths from any two nodes in a directed graph in which the edges have a weight. The sum of these weights gives us the cost of each path so that we can pick the cheapest path.

The best way is probably to use the Floyd-Warshall or Johnson algorithm in a procedural language and load a table with the results. However, I want to do this in pure SQL as an exercise. Let's start with a simple graph and represent it as an adjacency list with weights on the edges.

CREATE TABLE Graph (source CHAR(2) NOT NULL, destination CHAR(2) NOT NULL, cost INTEGER NOT NULL, PRIMARY KEY (source, destination));

I obtained data for this table from the book Introduction to Algorithms by Cormen, Leiserson, and Rivest (Cambridge, Mass., MIT Press, 1990, p. 518; ISBN 0-262-03141-8). This book is very popular in college courses in the United States. I made one decision that will be important later—I added self-traversal edges—the node is both the source and the destination so the cost of those paths is zero.

INSERT INTO Graph VALUES ('s', 's', 0), ('s', 'u', 3), ('s', 'x', 5), ('u', 'u', 0), ('u', 'v', 6), ('u', 'x', 2), ('v', 'v', 0), ('v', 'y', 2), ('x', 'u', 1), ('x', 'v', 4), ('x', 'x', 0), ('x', 'y', 6), ('y', 's', 3), ('y', 'v', 7), ('y', 'y', 0);

I am not happy about this approach, because I have to decide the maximum number of edges in a path before I start looking for an answer. However, this will work, and I know that a path will have no more than the total number of nodes in the graph. Let's create a table to hold the paths:

CREATE TABLE Paths (step_1 CHAR(2) NOT NULL, step_2 CHAR(2) NOT NULL, step_3 CHAR(2) NOT NULL, step_4 CHAR(2) NOT NULL, step_5 CHAR(2) NOT NULL, total_cost INTEGER NOT NULL, path_length INTEGER NOT NULL, PRIMARY KEY (step_1, step_2, step_3, step_4, step_5));

The step_1 node is where I begin the path. The other columns are the second step, third step, fourth step, and so forth. The last step column is the end of the journey. The total_cost column is the total cost, based on the sum of the weights of the edges, on this path. The path_length column is harder to explain, but for now let's just say that it is a count of the nodes visited in the path.

To keep things easier let's look at all the paths from "s" to "y" in the graph. The INSERT INTO statement for constructing that set looks like this:

INSERT INTO Paths SELECT G1.source, it is 's' in this example G2.source, G3.source, G4.source, G4.destination, it is 'y' in this example (G1.cost + G2.cost + G3.cost + G4.cost), (CASE WHEN G1.source NOT IN (G2.source, G3.source, G4.source) THEN 1 ELSE 0 END + CASE WHEN G2.source NOT IN (G1.source, G3.source, G4.source) THEN 1 ELSE 0 END + CASE WHEN G3.source NOT IN (G1.source, G2.source, G4.source) THEN 1 ELSE 0 END + CASE WHEN G4.source NOT IN (G1.source, G2.source, G3.source) THEN 1 ELSE 0 END) FROM Graph AS G1, Graph AS G2, Graph AS G3, Graph AS G4 WHERE G1.source = 's' AND G1.destination = G2.source AND G2.destination = G3.source AND G3.destination = G4.source AND G4.destination = 'y';

I put in "s" and "y" as the source and destination of the path and made sure that the destination of one step in the path was the source of the next step in the path. This is a combinatorial explosion, but it is easy to read and understand.

The sum of the weights is the cost of the path, which is easy to understand. The path_length calculation is a bit harder. This sum of CASE expressions looks at each node in the path. If it is unique within the row, it is assigned a value of one; if it is not unique within the row, it is assigned a value of zero.

All paths will have five steps in them, because that is the way the table is declared. However, what if a path shorter than five steps exists between the two nodes? That is where the self-traversal rows are used! Consecutive pairs of steps in the same row can be repetitions of the same node.

Here is what the rows of the paths table look like after this INSERT INTO statement, ordered by descending path_length, and then by ascending cost:

Paths step_1 step_2 step_3 step_4 step_5 total_cost path_length

s s x x y 11 0 s s s x y 11 1 s x x x y 11 1 s x u x y 14 2 s s u v y 11 2 s s u x y 11 2 s s x v y 11 2 s s x y y 11 2 s u u v y 11 2 s u u x y 11 2 s u v v y 11 2 s u x x y 11 2 s x v v y 11 2 s x x v y 11 2 s x x y y 11 2 s x y y y 11 2 s x y v y 20 4 s x u v y 14 4 s u v y y 11 4 s u x v y 11 4 s u x y y 11 4 s x v y y 11 4

(Continues...)



Excerpted from JOE CELKO'S TREES AND HIERARCHIES IN SQL FOR SMARTIES by Joe Celko Copyright © 2004 by Elsevier Inc.. Excerpted by permission of MORGAN KAUFMANN. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Read More

Customer Reviews

Average Review:

Write a Review

and post it to your social network

     

Most Helpful Customer Reviews

See all customer reviews >