0 1
2 3
4 5
6 7
8
------------------------------
| |
| |
| |
| |
| |
|
------------------------------
We need a way of mapping the numbers
to the array indexes, a hash function,
that will let us store the
numbers and later recompute the index when we want to retrieve them.
There is a natural choice
for this. Our hashtable has 9 fields and the mod function,
which sends every integer
to its remainder modulo 9, will map an integer to a number between 0 and
8.
5 mod 9 = 5
17 mod 9 = 8
37 mod 9 = 1
20 mod 9 = 2
42 mod 9 = 6
3 mod 9 = 3
We store the values:
0 1
2 3
4 5
6 7
8
-------------------------------
|
| 37 | 20 | 3 |
| 5 | 42 |
| 17 |
-------------------------------
In this case, computing the
hash value of the number n to be stored: n mod 9,
costs a constant amount
of time. And so does the actual storage,
because n is stored directly
in an array field.
If you want to check if
the value n is in the hash table, you would also have to
compute
n mod 9,
and then access the appropriate array field to see
if n
is stored there. Both operations take a constant amount of time.
A problem arises if you want
to store the number 11 in addition to the numbers
already in the hashtable.
11 hashes to 2 (=11 mod 9), and thus it would have to
be stored in the array field,
that is already occupied by the number 20.
This is called a collision.
It is all the more likely to happen, the
fuller the table gets.
The load factor measures
how full the table is:
number of items stored in the table
load factor = -----------------------------
number of array fields in the table
There are several ways of dealing with collisions:
In our example, we use a linked list:
0 1
2 3
4 5
6 7
8
-------------------------------
| \ |
| |
| \ |
| | \
| |
-----|---|---|-------|---|------|---
v v
v
v v
v
37 11 3
5 42
17
|
|
v
20
In
the worst case scenario, all items hash to the same value v.
Thus we store them in the data structure ( linked list, balanced
BST), referrred to
by hashtable[
v]. Then the data structure attached to hashtable[v]
will be nontrivial.
In this worst case we have time complexities:
linked list
BST
______________________________________________________________________________
Complexity for
storage
O(1)
O(logN)
lookup
O(N)
O(logN)
where
N = number of items in the hash table.
It
is possible to have a loadfactor > 1, when we are dealing with chained
hashing.
And that doesn't neccessarily mean that the hash table is overly
full, since the chain attached at each table entry may not be very long.
0 1
2 3
4 5
6 7
8
-------------------------------
|
| 37 | 20 |
3 | 11
| 5 | 42 |
| 17 |
-------------------------------
Both 20 and 11 hash to 2.
20 was inserted first, so when 11 came along, we have
to look for the first empty
space in the table after index 2. The index 3
is already occupied, but
the field at index 4 is still free, so we insert 11 there.
This illustrates how clusters
happen. They are all the more likely to occur, the
higher the loadfactor is.
Its limit value in this kind of hash table is 1,
bacause you cannot store
more items than the table has entries. A load factor of
0.6 is considered
quite high, because it indicates that storage and
retrieval are quite inefficient,
bacause of the presence of clusters, and thus call for a resize of the
hash table.
Suppose you do a lookup(11).
11 hashes to 2 and so you inspect hashtable[2].
Since you don't find 11
there, you keep looking at the successive table
entries until you either
find 11 or you come to an empty field.
Problem:
Assume now , that you want
to remove(3) from the hash table. 3 is located
at index 3.
0 1
2 3
4 5
6 7
8
-------------------------------
|
| 37 | 20 |
| 11 |
5 | 42 |
| 17 |
-------------------------------
Next, do a lookup(11). You
will stop immediately at index 3, because the cluster
has been broken and you
assume that nothing hashing to 2 will come
any more.
Solution:
Instead of leaving the place,
where we removed the item, empty, we leave a marker, a tombstone,
to indicate that an item
has been removed from this
location. This way you won't
stop a lookup, when you come across a tombstone.
0 1
2 3
4 5
6 7
8
-------------------------------
|
| 37 | 20 |
T | 11
| 5 | 42 |
| 17 |
-------------------------------
The time complexity of linear
probing depends on the length of the longest cluster. In the worst case,
all keys hash to the same value, thus making
insertion and deletion O(N)
operations. The reason, why one still does linear probing is, that
it is easy to implement.
If the keys are integers, with well distributed values (i.e., modding them with TABLE_SIZE is likely to produce results evenly distributed from 0 to TABLE_SIZE-1), then the hash function can just return: key mod TABLE-SIZE. However, if the keys are not well distributed, or if they are strings (or some other non-integer type), then we must convert them to well distributed integer values (before modding them by TABLE_SIZE).
Let's assume that the keys are strings. Most programming languages provide a way to convert individual characters to integers. In Java, you can just cast a char to an int; for example, to convert the first character in String S to an int, use:
int x = (int)(S.charAt(0));Once we know how to convert individual characters to integers, we can convert a whole string to an integer in a number of ways. First, though, let's think about which characters in the string we want to use. Remember that we want our hash function to be efficient, and to map different keys to different locations in the hashtable. Unfortunately, there tends to be tension between those two goals; for example, a hash function that only uses the first and last characters in a key will be faster to compute than a hash function that uses all of the characters in the key, but it will also be more likely to map different keys to the same hashtable location.
A reasonable solution is to compromise; e.g., to use N characters of the key for some reasonable value of N ("reasonable" will depend on how long you expect your keys to be and how fast you want your hash function to be). For keys that have more than N characters, there is also a question of which characters to choose. If the keys are likely to differ only at one end or the other (e.g., they are strings like "value1", "value2", etc, that differ only in their last characters), then this decision could make a big difference in how well the hash function "spreads out" the keys in the hashtable.
To simplify our discussion, let's assume that we know the keys won't be too long, so our hash function can simply use all of the characters in the keys.
Here are some ways to combine the integer values for the individual characters to compute a single integer n (remember that the hash function will return n mod TABLE_SIZE):
(8+5) * 13 = 169 (169+12) * 13 = 2353 (2353+12) * 13 = 30745 (30745+15) * 13 = 399880This technique gives you a wider range of values than just adding all of the characters. However, you would have to be prepared to handle overflow. (You would probably want to test the value of the sum so far, and if it is too large, divide by some constant, making it smaller to prevent overflow.)
Any object can be a key. If you want to hash an instance of a self defined
class, then you should include a hashcode() method in the class definition.
Just like the toString() and equals methods, this is a helpful utility.
You may use a combination of functions to convert your class instance to
an integer. Remember, that you do not need to worry about the table size
inside your key class.
Consider storing the names: George, Amy, Alan, and Sandy in a hashtable of size 10, using the hash function:
Consider hashing keys that are strings containing only lower-case letters. The hash function that will be used computes the product of the integer values of the characters in a key, using the scheme: a=0, b=1, c=2, etc. Why is this scheme not as good as using: a=1, b=2, etc?