Can we use hashing for sorting? Not in general case --- the items we hash on need not have total order semantics.
One of the things that makes hashing interesting is that it gives us efficient searching without requiring underlying sorting.
Effective hashes on strings need not involve the usual order on strings (lexicographical). For example, with HASH_TABLE_SIZE = 1000 and hashViaShift (see notes on hashing), hashViaShift("cat")==596 and hashViaShift("act")==612. So, in our hash table, "cat" will come before "act" (despite that with normal string ordering, "act" < "cat"):
______ |______| 0 |______| 1 |______| | ... | |------| |"cat" | 596 |------| | ... | |------| |"act" | 612 |------| | ... | |______|
However, we can use hashing to achieve sorting in cases where our hash function is monotone:
In other words, a monotone hash function will place items with keys that have a relative "lower" ordering early (near top) in the table and items with relatively greater ordering later (closer to the bottom) in the table.
Quiz: which of the following functions are monotone?
int hash(int i) { return (i % 100); }No, hash(22) > hash (101).
int foo(int i) { return (i + 1); }Yes: if x < y then x+1 < y+1
As a very simple case, suppose we want to sort a list of positive integers. Let M be the max of those integers. We can create a boolean array B of Max elements, initially with each entry set to false.
input: 4 3 6 1 2 9 |-------| 0 | false | |-------| 1 | true | |-------| 2 | true | |-------| 3 | true | |-------| 4 | true | |-------| 5 | false | |-------| 6 | true | |-------| 7 | false | |-------| 8 | false | |-------| 9 | ture | |-------| output: 1 2 3 4 6 9
This is the simplest kind of hash sort (the underlying hash function is the trivial identity function) known as a distribution sort. (It is also a simple form of a counting sort.)
Distribution sort:
initialize hash table so all slots marked as vacant for(i=0; i < n; i++) { insert A[i] into table at location hash(A[i]) (mark slot as occupied) } for(i=0,j=0; j < n; i++) { if table[i] is occupied { A[j] = item stored at table[i] j++ } }
What is the complexity of this method? O(M+n) where M is the max and n is the number of items in the array. Of course, with this simple formulation, we have no way of dealing with collisions, so we also must stipulate that the items to be sorted be unique. So M must be at least as big as n. So this sort is O(M). This is reasonable if M is roughly the same size as n.
Is it practical to use distribution sort to sort the following numbers?
65536 1024 16777216 32 32768
Counting sort is not efficient if we have large numbers. However, with a little luck (i.e. an effective hash function) we can extend hash sorting to work efficiently on more general input.
We use chained hashing and think of each entry in the hash table as a bin or bucket (this kind of sorting is often called bucket sorting). The hash table is implemented such that the first bucket contains an ordered collection of the smallest numbers, the next bucket contains an ordered collection of the next smallest numbers and so on so that the last bucket contains an ordered collection of the largest numbers. For example:
We can easily produce a sorted list of the numbers in the following hash table, by linking each chain together in order:
|---| 0 | | |---| 1 | -|-- 13 - 18 |---| 2 | -|-- 24 - 27 - 29 |---| 3 | | |---| 4 | -|-- 41 |---| sorted list: 13 18 24 27 29 41
A monotone hash function that would group the data as such might be:
int hash(int i) { return (i / 10); }(This will work as long as our table is no smaller than a tenth the size of the largest number.)
What do we need to make this into an effective sorting technique?
Maintaining ordered collections at each hash table entry may slow down the insertion process - worst-case O(n) if we use ordered lists, O(lg(n)) if we use ordered trees, making the sort time O(n2 ) or O(n*(lg(n)). However, if the hash function distributes the data well, then on average - there will be very few elements in each bucket, so the effective time for the sort will be O(n).
So bucket sorting is very effective when either the data is evenly distributed over a range, OR we have a hash function that disperses it evenly.
Bucket sorting:
initialize hash table so all slots have empty collections for(i=0; i < n; i++) { insert A[i] into collection at table[hash(A[i]]) } for(i=0,j=0; j < n; i++) { for each item in the collection at table[i] { A[j] = item j++ } }Example:
input : [7096,6051,553,1969,14205,9651,4194,12180,14721,13458, 7580,14920,2796,8344,11360,8168,10971,3851,1770,3122, 165,9557,8109,2844,14652] n = 25 min = 165 max = 14920 range = 14920 - 165 + 1 = 14756 bucketSize = (range + n - 1) / n = 14780 / 25 = 591 hash(i) = (i - min) / bucketSize 0 : 165 -- 553 1 : 2 : 1770 3 : 1969 4 : 2796 -- 2844 5 : 3122 6 : 3851 -- 4194 7 : 8 : 9 : 6051 10 : 11 : 7096 12 : 7580 13 : 8109 -- 8168 -- 8344 14 : 15 : 9557 16 : 9651 17 : 18 : 10971 -- 11360 19 : 20 : 12180 21 : 22 : 13458 23 : 14205 24 : 14652 -- 14721 -- 14920 output : [ 165,553,1770,1969,2796,2844,3122,3851,4194,6051, 7096,7580,8109,8168,8344,9557,9651,10971,11360, 12180,13458,14205,14652,14721,14920]
We were mistaken. That's the best we can do for comparison-based sorting. With well distributed numerical data (or with a suitable hash function) we can sort directly into locations in an array (or table) and, at least on average, obtain O(n) sorting.
However, we'd like to be able to have effective sorting algorithms in the most general cases:This brings us back to comparison-based sorts --- sorts that depend solely on our ability to determine whether one item is less than another. (No dependence on the kind of data.)
We have seen two quadratic (O(n2 )) algorithms:
template <class T> void selectionSort(T A[ ], size_t n) { for (size_t i = 0; i < n; i++) { T min = A[i]; size_t minIndex = i; for (size_t j = i+1; j < n; j++) if (A[j] < min ) { min = A[j]; minIndex = j; } swap(A[i], A[minIndex]); } }For example, consider selection sort on the following five-element array of integers:
[28,92,97, 3, 0] ------ [ 0,92,97, 3,28] [ 0, 3,97,92,28] [ 0, 3,28,92,97] [ 0, 3,28,92,97] [ 0, 3,28,92,97]
template <class T> void insertionSort(T A[], size_t n) { for (size_t i = 1; i < n; i++) { Item key = A[i]; int j = i-1; for(; j>=0 && A[j]>key; j--) swap(A[j],A[j+1]); A[j+1] = key; } }For example, consider insertion sort on the following eight-element array of integers:
[478,218,491,467,401,252,108,196] ------ [218,478] [218,467,478] [218,401,467,478] [218,252,401,467,478] [108,218,252,401,467,478] [108,196,218,252,401,467,478] [108,196,218,252,401,467,478,491]
A : [ 2, 98, 740, 769, 79, 318, 583, 48, 553, 30] mergeSort(A) L : [ 2, 198, 740, 769, 79] R : [ 318, 583, 48, 553, 30] ... L : [ 2, 79, 198, 740, 769] R : [ 30, 48, 318, 553, 583] merge: [ 2, 30, 48, 79, 198, 318, 553, 583, 740, 769]
mergeSort array A of size n { split A into two halves L and R if L has more than one element mergeSort L if R has more than one element mergeSort R merge L and R into A }
A : [ 507, 277, 284, 182, 158, 28, 183, 25] ------ mergeSort(A) A : [ 507, 277, 284, 182, 158, 28, 183, 25] L : [ 507, 277, 284, 182] R : [ 158, 28, 183, 25] mergeSort(L) A : [ 507, 277, 284, 182] L : [ 507, 277] R : [ 284, 182] mergeSort(L) A : [ 507, 277] L : [ 507] R : [ 277] merge(L,R) A : [ 277, 577] mergeSort(R) A : [ 284, 182] L : [ 284] R : [ 182] merge(L,R) A : [ 182, 284] merge(L,R) L : [ 277, 577] R : [ 182, 284] A : [ 182, 277, 284, 577] mergeSort(R) A : [ 158, 28, 183, 25] L : [ 158, 28] R : [ 183, 25] mergeSort(L) A : [ 158, 28] L : [ 158] R : [ 28] merge(L,R) A : [ 28, 158] mergeSort(R) A : [ 183, 25] L : [ 183] R : [ 25] merge(L,R) A : [ 25, 183] merge(L,R) L : [ 28, 158] R : [ 25, 183] A : [ 25, 28, 158, 183] merge(L,R) L : [ 182, 277, 284, 507] R : [ 25, 28, 158, 183] A : [ 25, 28, 158, 182, 183, 277, 284, 507]
merge L and R into A { i,j,k=0 while i < length(L) and j < length(R) { if L[i] < R[j] { A[k] = L[i]; i++; k++; } else { A[k] = R[j]; j++; k++; } } if there's anything left in L put it at end of A else put whatever is left in R at end of A }For example:
L : [ 2, 79, 198] R : [ 30, 318] merge L and R into A: 0 1 2 0 1 ------------------- ------------- L : | 2 | 79 | 198 | R : | 30 | 318 | ------------------- ------------- ^ ^ 0 1 2 3 4 ------------------------------- A : | 2 | | | | | ------------------------------- ^ 0 1 2 0 1 ------------------- ------------- L : | 2 | 79 | 198 | R : | 30 | 318 | ------------------- ------------- ^ ^ 0 1 2 3 4 ------------------------------- A : | 2 | 30 | | | | ------------------------------- ^ 0 1 2 0 1 ------------------- ------------- L : | 2 | 79 | 198 | R : | 30 | 318 | ------------------- ------------- ^ ^ 0 1 2 3 4 ------------------------------- A : | 2 | 30 | | | | ------------------------------- ^ 0 1 2 0 1 ------------------- ------------- L : | 2 | 79 | 198 | R : | 30 | 318 | ------------------- ------------- ^ ^ 0 1 2 3 4 ------------------------------- A : | 2 | 30 | 79 | | | ------------------------------- ^ 0 1 2 0 1 ------------------- ------------- L : | 2 | 79 | 198 | R : | 30 | 318 | ------------------- ------------- ^ ^ 0 1 2 3 4 ------------------------------- A : | 2 | 30 | 79 | 198 | | ------------------------------- ^ 0 1 2 0 1 ------------------- ------------- L : | 2 | 79 | 198 | R : | 30 | 318 | ------------------- ------------- ^ ^ 0 1 2 3 4 ------------------------------- A : | 2 | 30 | 79 | 198 | 318 | ------------------------------- ^
merge is O(n) where n is the total number of elements in L and R.