Notes on Hashing

Searching

Given a collection of n elements, how can find a specific one:

Idea

Simple Example

Problem:

Solution:

The array is called the hash table. The the function that maps employee # to an array position is called the hash function. What if the range of possible values is too big for an array? (e.g., student IDs: 10 digits - don't want an array of size 9,999,999,999!)

Example
-------
. HASH_TABLE_SIZE = 10

. key value = 10 digit ID #

. hash fn = (1st 3 digits + 
             next 3 digits + 
             last 4 digits) % 10

  for example:
  
0703803319 =>    070   =>   3769 % 10 = 9
                 380
              + 3319
               -----
                3769

so id # 0703803319 is "hashed" to the value 9

What happens if two items get hashed to the same value? For example,

int easyHash(int n) { return n % 10;}
Now easyHash(11) and easyHash(21) both result in the value 1. This is called a collision.

How do we avoid collisions? Choose HASH_TABLE_SIZE and the hash function to spread values as evenly as possible:

  • Hash function (this can be the hard part):
  • Hashing strings to array indices

    Basic idea: map each character of its string to its ASCII value.

    ASCII values

    some ASCII values
    character ASCII
    'A' 65
    'B' 66
    'C' 67
    'T' 84
    'Z' 90
    'a' 97
    'b' 98
    'c' 99
    't' 116
    'z' 122

    One possibility (good for smallish tables)

    Treat each character as an int, add them all, return sum % table size:

    size_t hashViaAscii(String S)
    {
      size_t hashVal = 0;
      for (size_t k=0; k < S.length(); k++) 
        hashVal += S[k];  // S[k] returns ASCII value
                          // of the kth character of String S
      return (hashVal % HASH_TABLE_SIZE);
    }
    
    A few examples, with HASH_TABLE_SIZE = 100
    
    "Cat" hashes to 80:
    
    'C' ---->   67
    'a' ---->   97
    't' ---->  116
    --------------
               280 % 100 = 80
    
    "cat" hashes to 12:
    
    'c' ---->   99
    'a' ---->   97
    't' ---->  116
    --------------
               312 % 100 = 12
    
    "cab" hashes to 94:
    
    'c' ---->   99
    'a' ---->   97
    'b' ---->   98
    --------------
               294 % 100 = 94
    
    
    "act" hashes to 12 (just like "cat")
    
    

    Binary shift

    Idea: express an integer as a binary number, shift bits left or right to quickly double or halve the value.

    shift left: shift right:

    Binary shift operators

    i << n      // shift left n times 
                // (multiply i by 2n )
    
    10 << 4     // = 10 * 16 = 160
    
    (in binary)  1010 --->  10100000 
    (in decimal) 8+2  ---> 128+32    ---> 160 
    
    
    i >> n      // shift right n times
                // (divide i by 2n, discard remainder)
    
    37 >> 2     // = 37 / 4 = 9
    
    (in binary)    100101 ---> 1001
    (in decimal)   32+4+1 ---> 8 +1  ---> 9
    

    Another method (good for larger tables)

    Use shift to make values large

    size_t hashViaShift(String S)
    {
      size_t hashVal = 0;
      for (size_t k=0; k < S.length(); k++)
        hashVal = (hashVal << 5) + S[k];
    //                      ^
    //		left shift 5 places  
    //            same as multiplying by 32
    //  Note: << has lower precedence than + 
    //  so parentheses are important
    
      return (hashVal % HASH_TABLE_SIZE);
    }
    
    A few examples, with HASH_TABLE_SIZE = 1000
    
    "cat" hashes to 596:
    
    'c' --->     (0 << 5) +  99 --->      0 +  99 --->     99
    'a' --->    (99 << 5) +  97 --->   3168 +  97 --->   3265
    't' --->  (3265 << 5) + 116 ---> 104480 + 116 ---> 104596
    ---------------------------------------------------------
       104596 % 100 = 596
    
    "act" hashes to 612:
    
    'a' --->     (0 << 5) +  97 --->      0 +  97 --->     97
    'c' --->    (97 << 5) +  99 --->   3104 +  99 --->   3203
    't' --->  (3203 << 5) + 116 ---> 102496 + 116 ---> 102612
    ---------------------------------------------------------
       102612 % 100 = 612
    
    

    Resolving collisions

    What should we do when collisions happen? There are several solutions falling into two basic categories:

    Open-address hashing

    If key hashes to position k and array [k] already filled then find another, unfilled position ("open address").

    In this version of hashing, the maximum number of elements stored in the hash table is the size of the table (array) itself.

    Linear probing

    To find unfilled position, look in array [k+1], [k+2], etc. (wrapping around at the end of the array)

    Example (using HASH_TABLE_SIZE = 10)
    -------
    
    name		Hash(name)
    ----		----------
    David		6
    Paul		0
    Krista		3
    
    HashTable so far:
         0   1   2     3    4   5   6     7  8   9
     ----------------------------------------------
     | Paul |  |  | Krista |  |  | David |  |  |  |
     ----------------------------------------------
    
    
    Zach		0      <--- collision, insert into next free space ([1])
    
    new HashTable:
         0      1   2     3     4   5   6     7  8   9
     --------------------------------------------------
     | Paul | Zach |  | Krista |  |  | David |  |  |  |
     --------------------------------------------------
    
    Christine	1      <--- collision, insert into next free space ([2])
    Aaron		4
    Nick		9
    Greg		9      <--- collision, insert into next free space ([5])
    
    final Hashtable:
         0      1        2           3      4       5     6     7  8    9
     ------------------------------------------------------------------------
     | Paul | Zach | Christine | Krista | Aaron | Greg | David |  |  | Nick |
     ------------------------------------------------------------------------
    

    Removing items from the hash table

    Example Revisited
    -----------------
    
    insertions:
    
    name		Hash(name)
    ----		----------
    Paul		0
    Zach		0      <--- collision, insert into next free space ([1])
    
    HashTable:
         0      1   2  3  4  5  6  7  8  9
     ---------------------------------------
     | Paul | Zach |  |  |  |  |  |  |  |  |
     ---------------------------------------
    
    
    remove Paul:
    
      0    1    2  3  4  5  6  7  8  9
     ---------------------------------------
     |  | Zach |  |  |  |  |  |  |  |  |
     ---------------------------------------
    
    
    search Zach: NOT FOUND!
    

    How do we handle remove? search would hash key to k then start searching array [k], [k+1], etc. Return true if key found; false if current array slot is empty. But given remove, array slot may become empty, leaving a "hole"; we don't want search to return false just because it found a "hole".

    Solution: use special value indicating "formerly occupied." search will keep searching past this value, insert will insert into a position that has this value.

    HashTable:
         0      1   2  3  4  5  6  7  8  9
     ---------------------------------------
     | Paul | Zach |  |  |  |  |  |  |  |  |
     ---------------------------------------
    
    remove Paul:
    
                0              1    2  3  4 ...
     ---------------------------------------
     | <FORMERLY_OCCUPIED>  | Zach |  |  |  ...
     ---------------------------------------
    
    search Zach: Found at position 1
    

    Implementation

    a C++ interface

    enum table_cell_status {NEVER_USED, OCCUPIED, FORMERLY_USED};
    
    typedef string Key;
    
    template <class Item>
    class HashTable {
    public:
      static const size_t CAPACITY = ...
      HashTable();
      bool search(const Key&, Item&) const;
      void insert(const Item&, const Key&);
      void remove(const Key&);
      ...
    private:
      pair<pair<Item,Key>,table_cell_status> table[CAPACITY];
      size_t count;
      size_t hash(Key) const;
      size_t next_index(size_t index) const;
      bool isVacant(size_t index) const;
      bool neverUsed(size_t index) const;
      Key keyOf(size_t index) const;
    };
    

    insert

    template<class Item>
    void HashTable<Item>::insert(const Item& it, const string& key)
    {
      size_t i = hash(key);
      
      while(!isVacant(i) && keyOf(i) != key)
        i = next_index(i);
      table[i].first.first = it;
      table[i].first.second = key;
      table[i].second = OCCUPIED;
      count++;
      return;
    }
    

    search

    template<class Item>
    bool HashTable<Item>::search(const string& key, Item& it) const
    {
      size_t i = hash(key);
      while(!neverUsed(i) && keyOf(i) != key)
        i = next_index(i);
      bool result = !isVacant(i);
      if (result) 
        it = table[i].first.first;
      return result;
    }
    

    remove

    template<class Item>
    void HashTable<Item>::remove(const string& key)
    {
      size_t i = hash(key);
      while(!neverUsed(i) && keyOf(i) != key)
        i = next_index(i);
      if (!neverUsed(i)) {
        table[i].second = FORMERLY_USED;
        count--;
      }
    }
    

    Clustering: a problem with linear probing

    if values are not evenly distributed, a "cluster" may form in the array

    (x indicating an array slot is occupied)
           k  k+1 k+2 ... k+m
     -------------------------------------
     ... | x | x | x | x |   |   |   | ...
     -------------------------------------
    
    Now if we insert an something with a key that hashes to k, we have to perform O(m) operations to do the insert, even though most of the table may be empty.

    The bigger the cluster, the more likely it is to grow even larger then search/insert take longer and longer...

    Worst-case time for search/insert = O(# items in the table)

    Double hashing

    To avoid clustering, we can add a second hash function. Now, if a collision occurs, rather than placing the second item in the next vacant slot, we call the second hash function to determine where to place the item. For example, in insert might become:
    template<class Item>
    void HashTable<Item>::insert(const Item& it, const string& key)
    {
      size_t i = hash(key);
      
      while(!isVacant(i) && keyOf(i) != key)
      i = (i + hash2(key)) % CAPACITY;
      table[i].first.first = it;
      table[i].first.second = key;
      table[i].second = OCCUPIED;
      count++;
      return;
    }
    

    It is vital that all values returned by the 2nd hash function be relatively prime to HASH_TABLE_SIZE ("relatively prime" means no common factors); otherwise, on collision, might fail to find empty spaces!

    As an example of this problem, supppose:
     HASH_TABLE_SIZE = 10
    
      as in previous example, use 
          name        Hash(name)	Hash2(name)
          ----        ----------    -----------
          Krista	  3
          Allen	  8
          Peter	  8             4         <-- put Peter in position 4 past 8,
    					      wrapping around at end of array
    
    HashTable so far:
       0   1      2       3     4   5   6   7    8      9
     ------------------------------------------------------
     |   |   | Peter | Krista |   |   |   |   | Allen |   |
     ------------------------------------------------------
    
    
          Susan	  8             5  <-- problem!  position 8 already filled
          				       5 past 8 is 3, also filled
    				       5 past 3 is 8, back to where we started!
    

    To avoid this problem: all values less than a prime number p are relatively prime to p. So if using double hashing, we choose HASH_TABLE_SIZE to be a a prime number.

    (See Main & Savitch, pp.559-560 for more details.)

    Double hashing is likely better than linear probing, but both forms of open-address hashing have limitations:

    Chained Hashing

    Idea: instead of storing values directly in hash table, each array element is a list of all values that hash to that position. For example:

          name        Hash(name)
          ----        ----------
          Krista	  3
          Allen	  8
          Peter	  8
          Susan	  8
    
    
       0   1   2   3   4   5   6   7   8   9
     -----------------------------------------
     |   |   |   |   |   |   |   |   |   |   |
     --------------|-------------------|------
                   |                   |
    	     Krista              Allen
    	                           |
    				 Peter
    				   |
    				 Susan
    
    

    Advantages:

    Complexity

    remove, search - O(k), where k is the length of the largest list in the hash table.

    worst-case for any hash table situation remains: all values have been hashed to same location, so O(n).

    In practice will be better than linear probing or double hashing.

    If the probability that Hash (x) == i is 1/HASH_TABLE_SIZE for all i in range 0 to (HASH_TABLE_SIZE - 1) (i.e. hash function distributes values evenly) then the expected length of the list in table[i] after adding n values to the hash table is n/HASH_TABLE_SIZE.

    Important point: if number of values inserted is less than size of hash table, then expected length of every list is less than 1, so expected runtimes are all 0(1)!