Word Games

Implemenenting a Minimal Perfect Hash Function

Project Due: 11:59 PM Friday, August 9

1.0 Objective

This project is meant to help you develop skills at working with hash tables and writing entire programs from scratch. In this project you will only be provided with a project description and list of program requirements. No code will be provided at all. This means you will need to learn to design and implement an entire program on your own.

2.0 Project Description

2.1 Basic Program
For this project you will design a program that searches a text file looking for key words, keeps statistics on the results of the search, and prints out these statistics. To make this work, you will need to read all the key words from a file, build a minimal perfect hash table, and then start reading the contents of a second text file and look for these key words. The following sections examine each of these steps.

2.2 Reading from a File
To read from a file you must first create a FileReader object and then pass it to a BufferedReader object. At this point you can start reading from the file using the readLine() method. A few things to be careful of, the creation of a FileReader object can throw a FileNotFoundException. You are required to catch this exception, so you must create the FileReader object inside of a try block. Also, the readLine() method can also throw an exception. The exception it throws is IOException and you must catch this, too.

This all sounds a lot more confusing than it really is. Here is a simple example that opens a file, reads each line, and then prints them all out. Notice that readLine() returns null when there is no more data in the file.

import java.io.*;

class Tester {
	 public static void main(String[] args) {
		  // validate command line arguments
		  if(args.length != 1) {
				System.err.println("Error: usage - Tester file");
				System.exit(1);
		  }
		  
		  try {
				// open up the file for reading
				FileReader input = new FileReader(args[0]);
		      BufferedReader in = new BufferedReader(input);
		  
				// now read each line and print it out
				String line;
				while((line = in.readLine()) != null)
					 System.out.println(line);
		  } catch(FileNotFoundException e) {
				System.err.println("File " + args[0] + " not found.");
				System.exit(1);
		  } catch(IOException e) {
				System.err.println(e);
				System.exit(1);
		  }
	 }
}

If you want, you can cut and paste this program and run it exactly as it is. That is about all there is to opening and reading files in Java.

2.3 The Hash Function
For this project you will be using Cichelli's Method to build a minimal perfect hash table. How to build this table is discussed in the next section. Here we describe the hash function itself.

h(word) = (length(word) + g*(firstletter(word)) + g*(lastletter(word))) % size

The following table is a description of each term in this equation:

Term Description

h(word) This will be the index into the hash table for the word.

length(word) The number of characters in the word.

g g is the value associated with a given letter. The value ranges between 0 and some maximum value. The max value is one of the inputs to the Cichelli Algorithm.

firstletter(word) The character that is in the first position of the word.

lastletter(word) The character that is at the end of the word.

size The total number of key words. It is also the number of elements in the hash table.

To compute the hash function for any integer is going to require a number of data structures. Obviously, you must have the array that represents the hash table. This array will be filled with the actual key words. Each key word should appear only once in the hash table. Secondly, you will need an array to store the g value of each letter that appears in the beginning or end of a word.

2.4 Building a Minimal Perfect Hash Table
This is one of the major components of your project. You will need to open the key word file provided by the user on the command line and read each of the words into a vector. Once you have them all read, you can start to build your hash table and define your hash function. You will accomplish this by using the Cichelli Method discussed in class. The vector that all the key words were initially read into will be the initial word list for the algorithm. A review of the algorithm can be found in the third set of notes on Hashing (Hashing - Part III).

2.5 Counting Key Words
Once this minimal perfect hash table is constructed, you are ready to begin reading the text file and counting key words. This should be a fairly simple process. You will read a line from the text file (see Section 2.2 on reading from files), break the line into tokens (one token for each word in the line), and then examine each token. To examine a token, simply pass it through the hash function discussed in Section 2.3 and see if the word is currently in the hash table.

A couple of things to watch for. First of all, you should make all of your comparisons case insensitive. In fact, I suggest that wherever you store all of your strings in either all lower case or all upper case. This will make your life a bit simpler. Secondly, you need to be careful of the fact that punctuation will be included in your comparison tokens. For example, if one of the key words is "me" and you read the following line:

This just isn't for me.

You need to realize that the last token returned from this line is "me." Notice the period is included in the token. "me" and "me." are not equal. To get full credit on this project you do not need to worry about this - you would simply say that there are no key words on the above line. One of the extra credit options is for you to fix this so that it does indicate that both "me" and "me." and "me?" and so on, are all the same word.

2.6 Statistics
Another major part of this project is recording statistics. A summary of all the statistics you must keep are presented in the following table:

Statistic Description

lines This is the total number of lines read by your tester program. Blank lines should not be counted in this total. Only lines that actually have at least one word on them should be counted.

words This is the total number of words checked (key words plus non- key words). A word does not have to be in the dictionary. A word is considered to be any string of consecutive characters with no white space in between them. In other words both "hello" and "sadflk" are considered to be words by this program.

keyWords This is the total number of key words found in the text. Words should only be counted if they are an exact match to the keyword given in the file. In other words, "Alabama" is not the same as "alabama". Hopefully, this will make things easier.

Time You need to record the total time of your program. This time should include the time to read the key words, create the hash table, read the text file, and record all the statistics. It is suggested that you start a timer as one of the first things you do in the program and stop it immediately before printing out the statistics. The following is a little sample code that shows how you can time events. You need to use the Date class to do it so you may want to check out the documentation on this method at the Java API web site. Do not worry about the Thread.sleep() method in the program - you do not (and should not) be using this function. It only appears here to simulate some kind of work being done.

import java.util.*; class Timer { public static void main(String[] args) { try { // start the timer Date timeStart = new Date(); // do the actual program work Thread.sleep(2000); // stop the timer Date timeEnd = new Date(); // calculate the total time and print the result long totalTime = timeEnd.getTime() - timeStart.getTime(); System.out.println("Total Time: " + totalTime); } catch(InterruptedException e) { } } }
Again, you can copy and run this code exactly as it appears to see what the output is.

When the program is finished reading the text file, it should print out a list of statistics. The output should look exactly like the following:

**********************
***** Statistics *****
**********************
Total Lines Read: xxx
Total Words Read: xxx
Break Down by Key Word
     key1: xx
     key2: xx
     key3: xx
       .
       .
       .
Total Key Words:  xxx
Total Time of Program: xxxxx milliseconds

The x's should be replaced with actual numbers and key1,2,3,... should be replaced with the actual key words. The number following the key word is to indicate how many times that key word appeared in the ananlyzed text.

3.0 Extra Credit

There is a total of 10 points available for extra credit. The first 5 points will be for ignoring punctuation in the text. In other words, the words "done", "done,", "done?", "done!", "done.", etc. would all be the same. However, if the character immediately following the word is not some kine of punctuation, it should not be considered a match. So the word "done#" would be counted as a key word (this assumes "done" is a key word).

The second 5 points will be awarded for making sure the key word file is always formatted correctly. The proper format is one word per line. Words must contain alphabetically characters only (A - Z and a - z). You should also allow for comments on a line. The start of a comment is '#' symbol. A comment can begin at the start of a line or after a key word. Here is a valid key word file:

# 3 key words
hello
the     # this should be the most common word

are
# end of the file

Notice that blank lines are allowd and should not be considered errors. If you encounter an invalid line, simply print an error message and terminate the program. Here is an example of an invalid file:

hello the   # too many words on a line
.hello.     # not all alphabetical characters
not$valid	# invalid symbol in the center of the line

To ensure that you work on the brunt of the program before beginning any of the extra credit, extra credit will only be counted if your overall score (before counting the extra credit) is above a 60%. Making sure input is correct is not as simple as it sounds so save this step for last.

4.0 Program Design Tips

The number one rule about writing a program from scratch is not to write a single line of code until you have developed a good design. Developing this good design is probably the hardest thing to learn how to do in programming. Here are a few of my suggestions on how you should go about designing and writing this project.

Read the program through once to get an idea of what is expected. Then, read it again. The second time through you may want to take some notes. At this point, you should have a thorough understanding of what is expected of you. If you need to read it a third time, do so.
Do a very rough design. This basically means figuring out what classes you are going to want to create and what the purpose of each class will be. For this project, I can think of at least three classes you should build. A class that contains your main() method. A class for the Cichelli hash table. And another class to hold, calculate, and print the statistics. You may also want additional classes for parsing files or for some other kind of work. It is completely up to you; however, be sure you know why you need each class and what its purpose will be for.
Once you have a list of your classes, go through each class and determine what functions it will need and what each of these functions will do. This should include information about what parameters each function will take and what type of value it will return.
Then go through each function and write psuedo-code describing how the function will flow. This is a very critical step because if it is done properly, your code writing will be much simpler. A good piece of advice at this stage is when you are examining a function, assume all the other functions already work - even if you have not written the pseudo-code for a function. If you did Step 3 properly, you will know how each function should work, what arguments it will take, and what values it will return.
The very last step is to go ahead and write your project up. Do not write all the code at once. Write functions one or two at a time, test them and make sure they work as you expect and then go on to the next function or two. It is always best to start with the lowest level functions first and work your way up. In other words, if you have some function that does not call any other of your functions, that is the one you want to start with. You can then test it and make sure it works (this often requires a little creativity). Once you have all of these functions written, you can then move up the chain to functions that only call these lowest level functions. Once you finish all of these, move up another level. Continue to do this until all of your functions are written.

One of the things you will notice about this design is that Step 4 calls for a top-down approach (describe all of the higher level functions assuming the lower level ones are done and then move down to the next level). Where as, Step 5 calls for a bottom-up approach (write and test all of the low level functions before moving on to the higher level functions). I think if you follow this approach, you should find your programs (for this or any other class) get written much more quickly and with far fewer bugs.

5.0 Running Your Program

Once your program has been compiled, it should be run in the following manner:

prompt> java WordCount keyWordFile textFile

The key word file should contain the list of words to search for. The text file is the file you will search looking for the key words. A sample of each file can be found below:

Key Word File
Text File
Results

Additional key word and text files will be provided later, but they will all have a similar format.

6.0 Handing in Your Project

The directory for handing in your program can be found at:

/p/course/cs367-mattmcc/public/username/p4

You should submit all *.java files that you created that are needed to compile your program. You should not submit any *.class files.

Also, be sure to comment your code well so that it can be easily browsed to see what is going on. Code that is excessively difficult to read will be penalized.

7.0 Grading

The project is due at 11:59 PM on Friday, August 9. NO LATE ASSIGNMENTS WILL BE ACCEPTED! The project will be graded out of 100 points and be graded on the following criteria:

Non-trivial program compilation
Correct Output
Performance (running time)

All grading will be done on the Novas. This does not mean you have to do your project on a Nova, but if you use some other machine, make sure it works on a Nova before handing it in.

We will grade your code by running it against a series of test files and checking the output of the program against a known, correct implementation. Any differences in your output versus that of the correct implementation will result in a point deduction. Currently, the only test files provided are keys-1.dat and text-1.dat. Links to these files can be found in Section 4.0 above as well as the results of running this test. Without a doubt, much more in depth tests will be run than this initial one provided. This means you should right some of your own test files and check for any errors.

We will also examine parts of your code to make sure that you are implementing the data structures as required. If you are getting the correct results but not implementing the proper data strcucture, your score will suffer severely. You must implement the proper data structures.

Term	Description
h(word)	This will be the index into the hash table for the word.
length(word)	The number of characters in the word.
g	g is the value associated with a given letter. The value ranges between 0 and some maximum value. The max value is one of the inputs to the Cichelli Algorithm.
firstletter(word)	The character that is in the first position of the word.
lastletter(word)	The character that is at the end of the word.
size	The total number of key words. It is also the number of elements in the hash table.

Statistic	Description
lines	This is the total number of lines read by your tester program. Blank lines should not be counted in this total. Only lines that actually have at least one word on them should be counted.
words	This is the total number of words checked (key words plus non- key words). A word does not have to be in the dictionary. A word is considered to be any string of consecutive characters with no white space in between them. In other words both "hello" and "sadflk" are considered to be words by this program.
keyWords	This is the total number of key words found in the text. Words should only be counted if they are an exact match to the keyword given in the file. In other words, "Alabama" is not the same as "alabama". Hopefully, this will make things easier.
Time	You need to record the total time of your program. This time should include the time to read the key words, create the hash table, read the text file, and record all the statistics. It is suggested that you start a timer as one of the first things you do in the program and stop it immediately before printing out the statistics. The following is a little sample code that shows how you can time events. You need to use the Date class to do it so you may want to check out the documentation on this method at the Java API web site. Do not worry about the Thread.sleep() method in the program - you do not (and should not) be using this function. It only appears here to simulate some kind of work being done. import java.util.*; class Timer { public static void main(String[] args) { try { // start the timer Date timeStart = new Date(); // do the actual program work Thread.sleep(2000); // stop the timer Date timeEnd = new Date(); // calculate the total time and print the result long totalTime = timeEnd.getTime() - timeStart.getTime(); System.out.println("Total Time: " + totalTime); } catch(InterruptedException e) { } } } Again, you can copy and run this code exactly as it appears to see what the output is.