Notes on Hashing

Searching

Given a collection of n elements, how can find a specific one:

Sequential search: look through each item, in whatever order we are given. Items need not be comparable. O(n) Can add elements in O(1).
Binary search: items are stored in array. Eliminate half of possibilities with each iteration. Total order among elements required. Difficult to add elements and still maintain efficient search. O(lg(n)). Alternately, use a binary search tree.
Hash tables: items need not be totally ordered or even comparable, just need to be mapped onto integer values. Expected time of O(1) for insert, search, and remove.

Idea

Instead of storing items in some (tree) structure, use an array.
Compute each item's position in the array as a function of its value. The computation is done by a hash function.
Take advantage of speed of subscripting operation on arrays

Simple Example

Problem:

store info about a company's employees;
each has a unique "employee #" starting at number 100.

Solution:

use an array of type "EmployeeInfo", big enough to hold all employee records
store employee # n's info in array [n-100]
so the function that maps a key (employee #), to the corresponding array position is: f (x) = x - 100

The array is called the hash table. The the function that maps employee # to an array position is called the hash function. What if the range of possible values is too big for an array? (e.g., student IDs: 10 digits - don't want an array of size 9,999,999,999!)

Decide how big to make the hash table (more later ...) --- call this HASH_TABLE_SIZE.
Design a hash function that maps all possible key values to a number in the range 0 .. (HASH_TABLE_SIZE - 1). As a trivial example we could write:
```
int trivialHash(int n) { return n % HASH_TABLE_SIZE;}
```


Example
-------
. HASH_TABLE_SIZE = 10

. key value = 10 digit ID #

. hash fn = (1st 3 digits + 
             next 3 digits + 
             last 4 digits) % 10

  for example:
  
0703803319 =>    070   =>   3769 % 10 = 9
                 380
              + 3319
               -----
                3769

so id # 0703803319 is "hashed" to the value 9

What happens if two items get hashed to the same value? For example,

int easyHash(int n) { return n % 10;}

Now easyHash(11) and easyHash(21) both result in the value 1. This is called a collision.

How do we avoid collisions? Choose HASH_TABLE_SIZE and the hash function to spread values as evenly as possible:

Hash table size:
- actual size depends on expected # of values and collision resolution scheme (roughly 1.5 * expected # of items seems reasonable)
- if number of items to be stored is not known, then change the size dynamically

Hash function (this can be the hard part):

if keys are integers, well distributed, with values larger than HASH_TABLE_SIZE, then key % HASH_TABLE_SIZE is reasonable
if keys are strings (more usual case) then must convert to int values

Hashing strings to array indices

Basic idea: map each character of its string to its ASCII value.

ASCII values

some ASCII values
character	ASCII
'A'	65
'B'	66
'C'	67
'T'	84
'Z'	90
'a'	97
'b'	98
'c'	99
't'	116
'z'	122

One possibility (good for smallish tables)

Treat each character as an int, add them all, return sum % table size:

size_t hashViaAscii(String S)
{
  size_t hashVal = 0;
  for (size_t k=0; k < S.length(); k++) 
    hashVal += S[k];  // S[k] returns ASCII value
                      // of the kth character of String S
  return (hashVal % HASH_TABLE_SIZE);
}

A few examples, with HASH_TABLE_SIZE = 100

"Cat" hashes to 80:

'C' ---->   67
'a' ---->   97
't' ---->  116
--------------
           280 % 100 = 80

"cat" hashes to 12:

'c' ---->   99
'a' ---->   97
't' ---->  116
--------------
           312 % 100 = 12

"cab" hashes to 94:

'c' ---->   99
'a' ---->   97
'b' ---->   98
--------------
           294 % 100 = 94


"act" hashes to 12 (just like "cat")

Binary shift

Idea: express an integer as a binary number, shift bits left or right to quickly double or halve the value.

shift left:

Slide bits one place to the left, add zero on right
For example: 101 -----> 1010
Effect: multiply by two

shift right:

Slide bits one place to the right, annihilate rightmost bit
For example: 1011 -----> 101
Effect: divide by two, discard remainder

Binary shift operators

i << n      // shift left n times 
            // (multiply i by 2ⁿ )

10 << 4     // = 10 * 16 = 160

(in binary)  1010 --->  10100000 
(in decimal) 8+2  ---> 128+32    ---> 160 


i >> n      // shift right n times
            // (divide i by 2ⁿ, discard remainder)

37 >> 2     // = 37 / 4 = 9

(in binary)    100101 ---> 1001
(in decimal)   32+4+1 ---> 8 +1  ---> 9

Another method (good for larger tables)

Use shift to make values large

size_t hashViaShift(String S)
{
  size_t hashVal = 0;
  for (size_t k=0; k < S.length(); k++)
    hashVal = (hashVal << 5) + S[k];
//                      ^
//		left shift 5 places  
//            same as multiplying by 32
//  Note: << has lower precedence than + 
//  so parentheses are important

  return (hashVal % HASH_TABLE_SIZE);
}

A few examples, with HASH_TABLE_SIZE = 1000

"cat" hashes to 596:

'c' --->     (0 << 5) +  99 --->      0 +  99 --->     99
'a' --->    (99 << 5) +  97 --->   3168 +  97 --->   3265
't' --->  (3265 << 5) + 116 ---> 104480 + 116 ---> 104596
---------------------------------------------------------
   104596 % 100 = 596

"act" hashes to 612:

'a' --->     (0 << 5) +  97 --->      0 +  97 --->     97
'c' --->    (97 << 5) +  99 --->   3104 +  99 --->   3203
't' --->  (3203 << 5) + 116 ---> 102496 + 116 ---> 102612
---------------------------------------------------------
   102612 % 100 = 612

Resolving collisions

What should we do when collisions happen? There are several solutions falling into two basic categories:

open-address hashing
chained hashing

Open-address hashing

If key hashes to position k and array [k] already filled then find another, unfilled position ("open address").

In this version of hashing, the maximum number of elements stored in the hash table is the size of the table (array) itself.

Linear probing

To find unfilled position, look in array [k+1], [k+2], etc. (wrapping around at the end of the array)

Example (using HASH_TABLE_SIZE = 10)
-------

name		Hash(name)
----		----------
David		6
Paul		0
Krista		3

HashTable so far:
     0   1   2     3    4   5   6     7  8   9
 ----------------------------------------------
 | Paul |  |  | Krista |  |  | David |  |  |  |
 ----------------------------------------------


Zach		0      <--- collision, insert into next free space ([1])

new HashTable:
     0      1   2     3     4   5   6     7  8   9
 --------------------------------------------------
 | Paul | Zach |  | Krista |  |  | David |  |  |  |
 --------------------------------------------------

Christine	1      <--- collision, insert into next free space ([2])
Aaron		4
Nick		9
Greg		9      <--- collision, insert into next free space ([5])

final Hashtable:
     0      1        2           3      4       5     6     7  8    9
 ------------------------------------------------------------------------
 | Paul | Zach | Christine | Krista | Aaron | Greg | David |  |  | Nick |
 ------------------------------------------------------------------------

Removing items from the hash table

Example Revisited
-----------------

insertions:

name		Hash(name)
----		----------
Paul		0
Zach		0      <--- collision, insert into next free space ([1])

HashTable:
     0      1   2  3  4  5  6  7  8  9
 ---------------------------------------
 | Paul | Zach |  |  |  |  |  |  |  |  |
 ---------------------------------------


remove Paul:

  0    1    2  3  4  5  6  7  8  9
 ---------------------------------------
 |  | Zach |  |  |  |  |  |  |  |  |
 ---------------------------------------


search Zach: NOT FOUND!

How do we handle remove? search would hash key to k then start searching array [k], [k+1], etc. Return true if key found; false if current array slot is empty. But given remove, array slot may become empty, leaving a "hole"; we don't want search to return false just because it found a "hole".

Solution: use special value indicating "formerly occupied." search will keep searching past this value, insert will insert into a position that has this value.

HashTable:
     0      1   2  3  4  5  6  7  8  9
 ---------------------------------------
 | Paul | Zach |  |  |  |  |  |  |  |  |
 ---------------------------------------

remove Paul:

            0              1    2  3  4 ...
 ---------------------------------------
 | <FORMERLY_OCCUPIED>  | Zach |  |  |  ...
 ---------------------------------------

search Zach: Found at position 1

Implementation

a C++ interface

enum table_cell_status {NEVER_USED, OCCUPIED, FORMERLY_USED};

typedef string Key;

template <class Item>
class HashTable {
public:
  static const size_t CAPACITY = ...
  HashTable();
  bool search(const Key&, Item&) const;
  void insert(const Item&, const Key&);
  void remove(const Key&);
  ...
private:
  pair<pair<Item,Key>,table_cell_status> table[CAPACITY];
  size_t count;
  size_t hash(Key) const;
  size_t next_index(size_t index) const;
  bool isVacant(size_t index) const;
  bool neverUsed(size_t index) const;
  Key keyOf(size_t index) const;
};

`insert`

template<class Item>
void HashTable<Item>::insert(const Item& it, const string& key)
{
  size_t i = hash(key);
  
  while(!isVacant(i) && keyOf(i) != key)
    i = next_index(i);
  table[i].first.first = it;
  table[i].first.second = key;
  table[i].second = OCCUPIED;
  count++;
  return;
}

`search`

template<class Item>
bool HashTable<Item>::search(const string& key, Item& it) const
{
  size_t i = hash(key);
  while(!neverUsed(i) && keyOf(i) != key)
    i = next_index(i);
  bool result = !isVacant(i);
  if (result) 
    it = table[i].first.first;
  return result;
}

`remove`

template<class Item>
void HashTable<Item>::remove(const string& key)
{
  size_t i = hash(key);
  while(!neverUsed(i) && keyOf(i) != key)
    i = next_index(i);
  if (!neverUsed(i)) {
    table[i].second = FORMERLY_USED;
    count--;
  }
}

Clustering: a problem with linear probing

if values are not evenly distributed, a "cluster" may form in the array

(x indicating an array slot is occupied)
       k  k+1 k+2 ... k+m
 -------------------------------------
 ... | x | x | x | x |   |   |   | ...
 -------------------------------------

Now if we insert an something with a key that hashes to k, we have to perform O(m) operations to do the insert, even though most of the table may be empty.

The bigger the cluster, the more likely it is to grow even larger then search/insert take longer and longer...

Worst-case time for search/insert = O(# items in the table)

Double hashing

To avoid clustering, we can add a second hash function. Now, if a collision occurs, rather than placing the second item in the next vacant slot, we call the second hash function to determine where to place the item. For example, in insert might become:

template<class Item>
void HashTable<Item>::insert(const Item& it, const string& key)
{
  size_t i = hash(key);
  
  while(!isVacant(i) && keyOf(i) != key)
  i = (i + hash2(key)) % CAPACITY;
  table[i].first.first = it;
  table[i].first.second = key;
  table[i].second = OCCUPIED;
  count++;
  return;
}

It is vital that all values returned by the 2nd hash function be relatively prime to HASH_TABLE_SIZE ("relatively prime" means no common factors); otherwise, on collision, might fail to find empty spaces!

As an example of this problem, supppose:

 HASH_TABLE_SIZE = 10

  as in previous example, use 
      name        Hash(name)	Hash2(name)
      ----        ----------    -----------
      Krista	  3
      Allen	  8
      Peter	  8             4         <-- put Peter in position 4 past 8,
					      wrapping around at end of array

HashTable so far:
   0   1      2       3     4   5   6   7    8      9
 ------------------------------------------------------
 |   |   | Peter | Krista |   |   |   |   | Allen |   |
 ------------------------------------------------------


      Susan	  8             5  <-- problem!  position 8 already filled
      				       5 past 8 is 3, also filled
				       5 past 3 is 8, back to where we started!

To avoid this problem: all values less than a prime number p are relatively prime to p. So if using double hashing, we choose HASH_TABLE_SIZE to be a a prime number.

(See Main & Savitch, pp.559-560 for more details.)

Double hashing is likely better than linear probing, but both forms of open-address hashing have limitations:

still requires special "empty" and "deleted" values
time for insert, search, remove all get worse as more values are inserted
what happens if the table is full?

Chained Hashing

Idea: instead of storing values directly in hash table, each array element is a list of all values that hash to that position. For example:

      name        Hash(name)
      ----        ----------
      Krista	  3
      Allen	  8
      Peter	  8
      Susan	  8


   0   1   2   3   4   5   6   7   8   9
 -----------------------------------------
 |   |   |   |   |   |   |   |   |   |   |
 --------------|-------------------|------
               |                   |
	     Krista              Allen
	                           |
				 Peter
				   |
				 Susan

Advantages:

No possibility of Clustering
Hash table is never full (assuming linked lists used)
No need for special EMPTY value (we know how to tell if a list is empty)
No problem with removing values -- just remove them from their list -- no special REMOVED value
Time for insert is 0(1)

Complexity

remove, search - O(k), where k is the length of the largest list in the hash table.

worst-case for any hash table situation remains: all values have been hashed to same location, so O(n).

In practice will be better than linear probing or double hashing.

If the probability that Hash (x) == i is 1/HASH_TABLE_SIZE for all i in range 0 to (HASH_TABLE_SIZE - 1) (i.e. hash function distributes values evenly) then the expected length of the list in table[i] after adding n values to the hash table is n/HASH_TABLE_SIZE.

Important point: if number of values inserted is less than size of hash table, then expected length of every list is less than 1, so expected runtimes are all 0(1)!