Hashtables

Introduction

One possibility would be to use an array. Insert in this case would be fast (O(1)), but delete and find are kind of slow (O(n)). A linked list representation would not improve performance.

To speed up finds, I could use a sorted array, which would be O(log n) at the cost of slower inserts (O(n)). These also requires me to be able to compare students in such a way that I can say that student a should come before student b. For instance, we could sort by last name. However, it would also be nice to avoid having values that were exactly the same: there may be multiple students with the name Smith, so we would not be able to distinguish between them. We could, of course incorporate the first name, but we would still have problems if there were multiple a lot of Alice Smith's at the university.

What we really need is a "key" value: a piece of information which is unique to that student. A student id number may work well: one would hope that these would be unique between students, and because they are numbers, it should be fairly easy to order students.

Unfortunately, if there are a lot of students at the university, this scheme may not be sufficient: lookups may be too slow. Ideally, we want a data structure which allows me to perform all operations in O(1) time. To do this, we need some way to say that student a, and only student a, maps to a specific location in our structure. Furthermore, this location must not change as students are added and deleted. Suppose we use an array as our base structure. We would then need some sort of function which takes the information about the student and outputs some unique index into the array for that student. If we use the key as our piece of information, we need a function h of the form: h(key) = index. This function is known as a hash function. One possibility is to say h(key) = key. This may not work well in our example, though: if the id of a student has many digits (like 901 123 4567), this number produces very large indices. We probably don't want to create an array this big!

Instead, we have to modify our function slightly: we want to get numbers in a "reasonable" range: indices which are within the bounds of the array we use. Unfortunately, this may be difficult to do and still maintain our uniqueness goal. An ideal hash function should:

be fast to compute
evenly distribute items by avoiding multiple hash values and empty spots

Some possible hash functions include:

use a subset of the digits as a key
add all digits
add groups of digits
perform modular arithmetic: h(key) = key % arraysize
combination of above

Another nice thing about hash functions is that sometimes you can hash values other than ints to produce an index. This concept is built into to Java, as there is a function called "hashCode()" in the Object class which converts the object into an int in a rather naive way.

Let's try building our structure, based upon the ideas discussed above. Because we are using a hash function, this structure is called a "hash table".

Design Considerations

As an example, suppose we had the following items in our hash table, which uses the piece of data as the key value, and mods by the size of the array to determine index (a blank spot indicates that there is no element there):

Index	0	1	2	3	4	5	6
		15	9		25		13

In trying to add 22 to the table, we would want to put it at index 1. But that spot is occupied, so we move to index 2. That is also occupied, so we move to index 3 before we find an open spot:

Index	0	1	2	3	4	5	6
		15	9	22	25		13

Now, to add 23, we have to move all of the way to index 5 before an open spot is found:

Index	0	1	2	3	4	5	6
		15	9	22	25	23	13

To look for data, we then go to the index we expect to find a value. We will look at that spot for the value. If it is not there, we will then have to continue looking until an empty spot is found because of our linear probing.

This may cause a problem when we delete, however: if we remove an element, we may not be able to items later because of the hole we create. For instance, consider the following hash table:

Index	0	1	2	3	4	5	6
		15	9	22	25	23	13

Removing 9 yields:

Index	0	1	2	3	4	5	6
		15		22	25	23	13

Then, we will not find item 22: when we start by looking at 1. Finding an item there which is not 22, we look to the next spot in case a linear probe happened. But the next spot is empty.

To solve this problem, we need to use "lazy deletion": some way to mark that there was once a value here so that we can probe further. We can then use the spot later for subsequent inserts.

Implementation

An array A to hold all of our data
An array K to hold all of our keys
An array v to keep track of our lazy deletions. This is an array of booleans

We also keep track of the table's size.

We begin with the get method, which returns an item in the hashtable based upon a key.

get(key) Input: key: the key value for the data being sought Returns: the item associated with the key; null if the key is not being used

    index = hash(key) % length(A)
    while v[index] == true
        if K[index] == key
	    return A[index]
        index = (index + 1) % length(A)
    return null

The contains method is similar, but simply returns whether or not the hashtable contains a piece of data associated with the key:

contains(key) Input: key: the key value for the data being searched for Returns: true if the hashtable contains a piece of data associated with the key; false otherwise

    index = hash(key) % length(A)
    while v[index] == true
        if K[index] == key
	    return true
        index = (index + 1) % length(A)
    return false

put(key, item) Input: key: the key value associated with the data being added (item)

    index = hash(key) % length(A)
    while A[index] != null
        index = (index + 1) % length(A)
    A[index] = item
    K[index] = key
    v[index] = true
    size = size + 1

Finally comes our remove method:

remove(key) Input: key: the key for the item to be removed Returns: the data item associated with the key; null if the item was not found

    index = hash(key) % length(A)
    while v[index] == true
        if K[index] == key
	    old = A[index]
	    A[index] = null
	    K[index] = null
	    size = size - 1
	    return null
	index = (index + 1) % length(A)
    return null

Analysis

On the average, the amount of time it takes to insert an item is 1 / (1 - a) when we make certain assumptions:

things hash in a uniform (random) fashion
the hash table length is "large"

Unfortunately, these are not always the case. When we perform linear probing, values tend to clump together. When things start to clump, further additions make the clumping even worse. This is known as primary clustering. Now, insertion typically takes time: 0.5 * (1 + (1/((1 - a)^2))), while a typical search takes time 0.5 * (1 + (1/(1 - a))).

When the load factor becomes "too high", we resize our array to hold more elements. This is typically done by changing the capacity to 2*oldcapacity + 1. We have to rehash our values to this new expanded array, rather than doing a straight copy.

Two questions remain: Why 2*oldcapacity + 1? We typically want the size of the array to be a prime number (I will not go into why, but this has been found to be true). 2*oldcapacity + 1 may or may not be prime, but at least it will be odd add will therefore be more likely to be prime.

Finally, how big should the load factor be before a resize takes place? The default value in Java is 0.75, so this seems like a good size.