Problem:
Solution:
int trivialHash(int n) { return n % HASH_TABLE_SIZE;}
Example ------- . HASH_TABLE_SIZE = 10 . key value = 10 digit ID # . hash fn = (1st 3 digits + next 3 digits + last 4 digits) % 10 for example: 0703803319 => 070 => 3769 % 10 = 9 380 + 3319 ----- 3769 so id # 0703803319 is "hashed" to the value 9
What happens if two items get hashed to the same value? For example,
int easyHash(int n) { return n % 10;}Now easyHash(11) and easyHash(21) both result in the value 1. This is called a collision.
How do we avoid collisions? Choose HASH_TABLE_SIZE and the hash function to spread values as evenly as possible:
Basic idea: map each character of its string to its ASCII value.
character | ASCII |
---|---|
'A' | 65 |
'B' | 66 |
'C' | 67 |
'T' | 84 |
'Z' | 90 |
'a' | 97 |
'b' | 98 |
'c' | 99 |
't' | 116 |
'z' | 122 |
Treat each character as an int, add them all, return sum % table size:
size_t hashViaAscii(String S) { size_t hashVal = 0; for (size_t k=0; k < S.length(); k++) hashVal += S[k]; // S[k] returns ASCII value // of the kth character of String S return (hashVal % HASH_TABLE_SIZE); }
A few examples, with HASH_TABLE_SIZE = 100 "Cat" hashes to 80: 'C' ----> 67 'a' ----> 97 't' ----> 116 -------------- 280 % 100 = 80 "cat" hashes to 12: 'c' ----> 99 'a' ----> 97 't' ----> 116 -------------- 312 % 100 = 12 "cab" hashes to 94: 'c' ----> 99 'a' ----> 97 'b' ----> 98 -------------- 294 % 100 = 94 "act" hashes to 12 (just like "cat")
Idea: express an integer as a binary number, shift bits left or right to quickly double or halve the value.
shift left:i << n // shift left n times // (multiply i by 2n ) 10 << 4 // = 10 * 16 = 160 (in binary) 1010 ---> 10100000 (in decimal) 8+2 ---> 128+32 ---> 160 i >> n // shift right n times // (divide i by 2n, discard remainder) 37 >> 2 // = 37 / 4 = 9 (in binary) 100101 ---> 1001 (in decimal) 32+4+1 ---> 8 +1 ---> 9
Use shift to make values large
size_t hashViaShift(String S) { size_t hashVal = 0; for (size_t k=0; k < S.length(); k++) hashVal = (hashVal << 5) + S[k]; // ^ // left shift 5 places // same as multiplying by 32 // Note: << has lower precedence than + // so parentheses are important return (hashVal % HASH_TABLE_SIZE); }
A few examples, with HASH_TABLE_SIZE = 1000 "cat" hashes to 596: 'c' ---> (0 << 5) + 99 ---> 0 + 99 ---> 99 'a' ---> (99 << 5) + 97 ---> 3168 + 97 ---> 3265 't' ---> (3265 << 5) + 116 ---> 104480 + 116 ---> 104596 --------------------------------------------------------- 104596 % 100 = 596 "act" hashes to 612: 'a' ---> (0 << 5) + 97 ---> 0 + 97 ---> 97 'c' ---> (97 << 5) + 99 ---> 3104 + 99 ---> 3203 't' ---> (3203 << 5) + 116 ---> 102496 + 116 ---> 102612 --------------------------------------------------------- 102612 % 100 = 612
If key hashes to position k and array [k] already filled then find another, unfilled position ("open address").
In this version of hashing, the maximum number of elements stored in the hash table is the size of the table (array) itself.
To find unfilled position, look in array [k+1], [k+2], etc. (wrapping around at the end of the array)
Example (using HASH_TABLE_SIZE = 10) ------- name Hash(name) ---- ---------- David 6 Paul 0 Krista 3 HashTable so far: 0 1 2 3 4 5 6 7 8 9 ---------------------------------------------- | Paul | | | Krista | | | David | | | | ---------------------------------------------- Zach 0 <--- collision, insert into next free space ([1]) new HashTable: 0 1 2 3 4 5 6 7 8 9 -------------------------------------------------- | Paul | Zach | | Krista | | | David | | | | -------------------------------------------------- Christine 1 <--- collision, insert into next free space ([2]) Aaron 4 Nick 9 Greg 9 <--- collision, insert into next free space ([5]) final Hashtable: 0 1 2 3 4 5 6 7 8 9 ------------------------------------------------------------------------ | Paul | Zach | Christine | Krista | Aaron | Greg | David | | | Nick | ------------------------------------------------------------------------
Example Revisited ----------------- insertions: name Hash(name) ---- ---------- Paul 0 Zach 0 <--- collision, insert into next free space ([1]) HashTable: 0 1 2 3 4 5 6 7 8 9 --------------------------------------- | Paul | Zach | | | | | | | | | --------------------------------------- remove Paul: 0 1 2 3 4 5 6 7 8 9 --------------------------------------- | | Zach | | | | | | | | | --------------------------------------- search Zach: NOT FOUND!
How do we handle remove? search would hash key to k then start searching array [k], [k+1], etc. Return true if key found; false if current array slot is empty. But given remove, array slot may become empty, leaving a "hole"; we don't want search to return false just because it found a "hole".
Solution: use special value indicating "formerly occupied." search will keep searching past this value, insert will insert into a position that has this value.
HashTable: 0 1 2 3 4 5 6 7 8 9 --------------------------------------- | Paul | Zach | | | | | | | | | --------------------------------------- remove Paul: 0 1 2 3 4 ... --------------------------------------- | <FORMERLY_OCCUPIED> | Zach | | | ... --------------------------------------- search Zach: Found at position 1
enum table_cell_status {NEVER_USED, OCCUPIED, FORMERLY_USED}; typedef string Key; template <class Item> class HashTable { public: static const size_t CAPACITY = ... HashTable(); bool search(const Key&, Item&) const; void insert(const Item&, const Key&); void remove(const Key&); ... private: pair<pair<Item,Key>,table_cell_status> table[CAPACITY]; size_t count; size_t hash(Key) const; size_t next_index(size_t index) const; bool isVacant(size_t index) const; bool neverUsed(size_t index) const; Key keyOf(size_t index) const; };
template<class Item> void HashTable<Item>::insert(const Item& it, const string& key) { size_t i = hash(key); while(!isVacant(i) && keyOf(i) != key) i = next_index(i); table[i].first.first = it; table[i].first.second = key; table[i].second = OCCUPIED; count++; return; }
template<class Item> bool HashTable<Item>::search(const string& key, Item& it) const { size_t i = hash(key); while(!neverUsed(i) && keyOf(i) != key) i = next_index(i); bool result = !isVacant(i); if (result) it = table[i].first.first; return result; }
template<class Item> void HashTable<Item>::remove(const string& key) { size_t i = hash(key); while(!neverUsed(i) && keyOf(i) != key) i = next_index(i); if (!neverUsed(i)) { table[i].second = FORMERLY_USED; count--; } }
if values are not evenly distributed, a "cluster" may form in the array
(x indicating an array slot is occupied) k k+1 k+2 ... k+m ------------------------------------- ... | x | x | x | x | | | | ... -------------------------------------Now if we insert an something with a key that hashes to k, we have to perform O(m) operations to do the insert, even though most of the table may be empty.
The bigger the cluster, the more likely it is to grow even larger then search/insert take longer and longer...
Worst-case time for search/insert = O(# items in the table)
template<class Item> void HashTable<Item>::insert(const Item& it, const string& key) { size_t i = hash(key); while(!isVacant(i) && keyOf(i) != key) i = (i + hash2(key)) % CAPACITY; table[i].first.first = it; table[i].first.second = key; table[i].second = OCCUPIED; count++; return; }
It is vital that all values returned by the 2nd hash function be relatively prime to HASH_TABLE_SIZE ("relatively prime" means no common factors); otherwise, on collision, might fail to find empty spaces!
As an example of this problem, supppose:HASH_TABLE_SIZE = 10
as in previous example, use name Hash(name) Hash2(name) ---- ---------- ----------- Krista 3 Allen 8 Peter 8 4 <-- put Peter in position 4 past 8, wrapping around at end of array HashTable so far: 0 1 2 3 4 5 6 7 8 9 ------------------------------------------------------ | | | Peter | Krista | | | | | Allen | | ------------------------------------------------------ Susan 8 5 <-- problem! position 8 already filled 5 past 8 is 3, also filled 5 past 3 is 8, back to where we started!
To avoid this problem: all values less than a prime number p are relatively prime to p. So if using double hashing, we choose HASH_TABLE_SIZE to be a a prime number.
(See Main & Savitch, pp.559-560 for more details.)
Double hashing is likely better than linear probing, but both forms of open-address hashing have limitations:
Idea: instead of storing values directly in hash table, each array element is a list of all values that hash to that position. For example:
name Hash(name) ---- ---------- Krista 3 Allen 8 Peter 8 Susan 8 0 1 2 3 4 5 6 7 8 9 ----------------------------------------- | | | | | | | | | | | --------------|-------------------|------ | | Krista Allen | Peter | SusanAdvantages:
remove, search - O(k), where k is the length of the largest list in the hash table.
worst-case for any hash table situation remains: all values have been hashed to same location, so O(n).
In practice will be better than linear probing or double hashing.
If the probability that Hash (x) == i is 1/HASH_TABLE_SIZE for all i in range 0 to (HASH_TABLE_SIZE - 1) (i.e. hash function distributes values evenly) then the expected length of the list in table[i] after adding n values to the hash table is n/HASH_TABLE_SIZE.
Important point: if number of values inserted is less than size of hash table, then expected length of every list is less than 1, so expected runtimes are all 0(1)!