CS 367 - Introduction to Data Structures - Section 3

Programming Assignment Four: Final Project --- Sorting

The web is a good place to find information on sorting. Here are some potentially interesting and useful links:

Note: some of these sites have code for sorting algorithms, perhaps even C++ code. Please write your own code. In any case, for whatever information you do use to influence your sorting algorithm please put a comment to that effect at the header of the algorithm. For example, if you used code that I gave in class:

void selectionSort(...)
// This code is based on the code provided by the instructor

void shellSort(...)
// This code is based on the algorithm described by the instructor

void fooSort(...)
// This code is based on the algorithm described in the book by so and so

void funkySort(...)
// This code is based on code appearing at the URL: ...

You get the idea.

Hash Sort

Can we use hashing for sorting? Not in general case --- the items we hash on need not have total order semantics.

One of the things that makes hashing interesting is that it gives us efficient searching without requiring underlying sorting.

For example, effective hashes on strings need not involve the usual order on strings (lexicographical). with HASH_TABLE_SIZE = 1000 and hashViaShift (see notes on hashing), hashViaShift("cat")==596 and hashViaShift("act")==612. So, in our hash table, "cat" will come before "act" (despite that with normal string ordering, "act" < "cat"):

 ______
|______|   0
|______|   1
|______|
| ...  |
|------|
|"cat" | 596
|------|
| ...  |
|------|
|"act" | 612
|------|
| ...  |
|______|

However, we can use a limited form of hashing to achieve sorting in cases where our hash function is _monotone_: if x < y then hash(x) < hash(y). In other words, a monotone hash function will place items with keys that have a relative "lower" ordering early in the table and items with relatively greater ordering later in the table.

As a very simple case, suppose we want to sort a list of positive integers. Let M be the max of those integers. We can create a boolean array B of Max elements, initially with each entry set to false.

For each i in our list, set B[i] = true
Now we can produce the sorted list of integers by walking through B and outputting only the indices for which B[i] is true.

This is the simplest kind of hash sort (the underlying hash function is the trivial identity function) known as a bucket, or distribution sort. Hash sorting is a generalization of that works as long as the hash function is monotone.

Hash sorting:

input: an array A of n items that have total order semantics.
action: place items of A into a hash_table using a monotone hash function. (can think of the item as being its own key)
result: A is replaced by the items from the occupied slots in the order in which they occur in the hash table.
preconditions:
- assume item uniqueness
- hash table must be at least as large as input array
- hash function must be a valid hash function:
  - it returns arrays indices in range 0 ... HASH_TABLE_SIZE-1
  - it's monotone: if A[i] < A[j] then hash(A[i]) < hash(A[j])

pseudocode

initialize hash table so all slots marked as vacant

for(i=0; i < n; i++) {
  insert A[i] into table at location hash(A[i])
  (mark slot as occupied)
}

for(i=0,j=0; j < n; i++) {
  if table[i] is occupied {
    A[j] = item stored at table[i]
    j++
  }
}

Shell Sort

Recall insertion sort:

template <class Item>
void insertionSort(Item A[], size_t n)
{
  for (size_t i = 1; i < n; i++) {
    Item key = A[i];
    int j = i-1;
    for(; j>=0 && A[j]>key; j--) 
      swap(A[j],A[j+1]);
    A[j+1] = key;
  }
}

Shell sort is a a generalization of insertion sort, sometimes known as diminishing increment sort.

If A is an array of N items, we say the array is k-sorted if for every valid index i such that i+k<N, A[i] <= A[i+k]

For example,

        0   1   2   3   4   5
      ________________________
A[] = | 5 | 1 | 2 | 5 | 3 | 9 |
      ------------------------

is 3-sorted, since:

A[0] <= A[3] (5 <= 5)
A[1] <= A[4] (1 <= 3)
A[2] <= A[5] (2 <= 9)

but it is not 2-sorted since

A[0] > A[2] (5 > 2)

Another way to think of it: A is k-sorted if for each i < k, the sequence A[i], A[i+k], A[i+2k], A[i+3k], ... A[i+jk] (where i+jk is < n, but i + (j+1)k is >= n) is sorted. (We can call such a sequence the i-k sequence of A) The 0-1 sequence is all of A. (And A is sorted if and only if it is 1-sorted.)

For the above example:

the 0-3 sequence is: 5,5
the 1-3 sequence is: 1,3
the 2-3 sequence is: 2,9
the 0-2 sequence is: 5,2,3
the 1-2 sequence is: 1,5,9

Note that since the 0-3, 1-3, and 2-3 sequences are all sorted, A is 3-sorted. However, since the 0-2 sequence is not sorted, A is not 2-sorted.

We can k-sort an array simply: for each i between 0 and k-1, use insertion sort to sort the i-k sequence.

So, we can 2-sort the above array (meaning sort 5,2,3 and sort 1,5,9) to arrive at:

        0   1   2   3   4   5
      ________________________
A[] = | 2 | 1 | 3 | 5 | 5 | 9 |
      ------------------------

which is 2-sorted.

To finish the sort, we can 1-sort (just call insertion sort on the whole thing). This is much more efficient then it sounds because insertion sort performs well on nearly sorted data.

So, the idea of Shell sort is to repeatedly k-sort the array for a decreasing sequence of k's such that last time k=1 at which point the array is completely sorted.

Obviously if we use each k from n-1 down to 1, then this will work, but that is overkill. It suffices to use a geometrically decreasing sequence (i.e. divide k by two each time) to work. For example: k=n/2, k=n/4, ... , k=1

You can use variations on this sequence. For example: k=(n+1)/2, k=(n+1)/4, ...

You can get better performance if the sequence of values for k have no common factors (think about why).