Hash functions (on Feb. 1, 1999 by Surin Kittitornkun) Properties of hash function must be 1. fast 2. randominzing, and 3. uniform. Possible hash functions for String Examples of bad functions: a. H(c0 c1 ... cn-1) = c0 b. H(c0 c1 ... cn-1) = n, where n is the length of the string. c. H(c0 c1 ... cn-1) = Sum(c0, c1,..., cn-1) mod M, where M=table size, 100s<= M <= 1000s Disadvantage: 1. not randomizing ex: "abcd" and "dcba" all hash to the same location (not sensitive to location of each character) 2. not uniform d. H(c0 c1 ... cn) = Product(c0 c1 ... cn) mod M 1. uniform 2. still not sensitive to order of character Better functions: H(c0 c1 ... cn) = Sum((i+1)*ci) mod M From java.lang.String.hashcode() : if L (length) <= 15, H(c0 c1 ... cn) = Sum(ci*37^i), i=0,1,...,L-1 if L (length) > 15, H(c0 c1 ... cn) = Sum(cik*39^i), i=0,1,...,m-1 where k=floor(L/8) and k=ciel(L/k) Why multiplication is not good ? * overflow Multiplication seems to cause overflow. If no exception, (+) number => (-) number becasue of two's complement arithmetic Let a=i*M+j then (a*b) mod M = ((a mod M) * b) mod M = (j* b) mod M proof : (a*b) mod M = ((i*M + j) * b ) mod M = (i*b*M + j*b) mod M = (0 + j*b) mod M = (j*b) mod M * multiplication result is more likely even number. If table size is also even, index will be even more likely. Table size - Prime number is preferred for uniformity. - Don't choose table size to be multiple of small prime number e.g. 3, if the has function is multiple of the same number which is 3 Ex. table size M=2*3*5*7 = 210 After hashing 26,000 words, 57.6% of all words hit index of 0 If M'=211, no location gets > 1% hits