2.0 Project Description
2.1 Basic Program
For this project you will design a program that searches a text file
looking for key words, keeps statistics on the results of the search,
and prints out these statistics. To make this work, you will need to
read all the key words from a file, build a minimal perfect hash table,
and then start reading the contents of a second text file and look for
these key words. The following sections examine each of these steps.
2.2 Reading from a File
To read from a file you must first create a FileReader object and
then pass it to a BufferedReader
object. At this point you can start reading from the file using the
readLine() method. A few things to be careful of, the creation of
a FileReader object can throw a FileNotFoundException. You
are required to catch this exception, so you must create the FileReader
object inside of a try block. Also, the readLine() method can
also throw an exception. The exception it throws is IOException and
you must catch this, too.
This all sounds a lot more confusing than it really is. Here is a simple example that opens a file, reads each line, and then prints them all out. Notice that readLine() returns null when there is no more data in the file.
import java.io.*; class Tester { public static void main(String[] args) { // validate command line arguments if(args.length != 1) { System.err.println("Error: usage - Tester file"); System.exit(1); } try { // open up the file for reading FileReader input = new FileReader(args[0]); BufferedReader in = new BufferedReader(input); // now read each line and print it out String line; while((line = in.readLine()) != null) System.out.println(line); } catch(FileNotFoundException e) { System.err.println("File " + args[0] + " not found."); System.exit(1); } catch(IOException e) { System.err.println(e); System.exit(1); } } }
If you want, you can cut and paste this program and run it exactly as it is. That is about all there is to opening and reading files in Java.
2.3 The Hash Function
For this project you will be using Cichelli's Method to build
a minimal perfect hash table. How to build this table is discussed in
the next section. Here we describe the hash function itself.
h(word) = (length(word) + g*(firstletter(word)) + g*(lastletter(word))) % size
The following table is a description of each term in this equation:
Term | Description |
h(word) | This will be the index into the hash table for the word. |
length(word) | The number of characters in the word. |
g | g is the value associated with a given letter. The value ranges between 0 and some maximum value. The max value is one of the inputs to the Cichelli Algorithm. |
firstletter(word) | The character that is in the first position of the word. |
lastletter(word) | The character that is at the end of the word. |
size | The total number of key words. It is also the number of elements in the hash table. |
To compute the hash function for any integer is going to require a number of data structures. Obviously, you must have the array that represents the hash table. This array will be filled with the actual key words. Each key word should appear only once in the hash table. Secondly, you will need an array to store the g value of each letter that appears in the beginning or end of a word.
2.4 Building a Minimal Perfect Hash Table
This is one of the major components of your project. You will need to open
the key word file provided by the user on the command line and read each
of the words into a vector. Once you have them all read, you can start to
build your hash table and define your hash function. You will accomplish
this by using the Cichelli Method discussed in class. The vector
that all the key words were initially read into will be the initial
word list for the algorithm. A review of the algorithm can be found in
the third set of notes on Hashing (Hashing - Part III).
2.5 Counting Key Words
Once this minimal perfect hash table is constructed, you are ready to begin
reading the text file and counting key words. This should be a fairly
simple process. You will read a line from the text file (see Section
2.2 on reading from files), break the line into tokens (one token for
each word in the line), and then examine each token. To examine a token,
simply pass it through the hash function discussed in Section 2.3 and
see if the word is currently in the hash table.
A couple of things to watch for. First of all, you should make all of your comparisons case insensitive. In fact, I suggest that wherever you store all of your strings in either all lower case or all upper case. This will make your life a bit simpler. Secondly, you need to be careful of the fact that punctuation will be included in your comparison tokens. For example, if one of the key words is "me" and you read the following line:
This just isn't for me.
You need to realize that the last token returned from this line is "me." Notice the period is included in the token. "me" and "me." are not equal. To get full credit on this project you do not need to worry about this - you would simply say that there are no key words on the above line. One of the extra credit options is for you to fix this so that it does indicate that both "me" and "me." and "me?" and so on, are all the same word.
2.6 Statistics
Another major part of this project is recording statistics. A summary of
all the statistics you must keep are presented in the following table:
Statistic | Description |
lines | This is the total number of lines read by your tester program. Blank lines should not be counted in this total. Only lines that actually have at least one word on them should be counted. |
words | This is the total number of words checked (key words plus non- key words). A word does not have to be in the dictionary. A word is considered to be any string of consecutive characters with no white space in between them. In other words both "hello" and "sadflk" are considered to be words by this program. |
keyWords | This is the total number of key words found in the text. Words should only be counted if they are an exact match to the keyword given in the file. In other words, "Alabama" is not the same as "alabama". Hopefully, this will make things easier. |
Time | You need to record the total time of your program. This time
should include the time to read the key words, create the
hash table, read the text file, and record all the statistics.
It is suggested that you start a timer as one of the first
things you do in the program and stop it immediately before
printing out the statistics. The following is a little sample
code that shows how you can time events. You need to use the
Date class to do it so you may want to check out the
documentation on this method at the Java API web site. Do not
worry about the Thread.sleep() method in the program -
you do not (and should not) be using this function. It only
appears here to simulate some kind of work being done.
import java.util.*; class Timer { public static void main(String[] args) { try { // start the timer Date timeStart = new Date(); // do the actual program work Thread.sleep(2000); // stop the timer Date timeEnd = new Date(); // calculate the total time and print the result long totalTime = timeEnd.getTime() - timeStart.getTime(); System.out.println("Total Time: " + totalTime); } catch(InterruptedException e) { } } } Again, you can copy and run this code exactly as it appears to see what the output is. |
When the program is finished reading the text file, it should print out a list of statistics. The output should look exactly like the following:
********************** ***** Statistics ***** ********************** Total Lines Read: xxx Total Words Read: xxx Break Down by Key Word key1: xx key2: xx key3: xx . . . Total Key Words: xxx Total Time of Program: xxxxx milliseconds
The x's should be replaced with actual numbers and key1,2,3,... should be replaced with the actual key words. The number following the key word is to indicate how many times that key word appeared in the ananlyzed text.
3.0 Extra Credit
There is a total of 10 points available for extra credit. The first 5 points
will be for ignoring punctuation in the text. In other words, the words
"done", "done,", "done?", "done!", "done.", etc. would all be the same.
However, if the character immediately following the word is not some kine
of punctuation, it should not be considered a match. So the word "done#"
would be counted as a key word (this assumes "done" is a key word).
The second 5 points will be awarded for making sure the key word file is always formatted correctly. The proper format is one word per line. Words must contain alphabetically characters only (A - Z and a - z). You should also allow for comments on a line. The start of a comment is '#' symbol. A comment can begin at the start of a line or after a key word. Here is a valid key word file:
# 3 key words hello the # this should be the most common word are # end of the file
Notice that blank lines are allowd and should not be considered errors. If you encounter an invalid line, simply print an error message and terminate the program. Here is an example of an invalid file:
hello the # too many words on a line .hello. # not all alphabetical characters not$valid # invalid symbol in the center of the line
To ensure that you work on the brunt of the program before beginning any of the extra credit, extra credit will only be counted if your overall score (before counting the extra credit) is above a 60%. Making sure input is correct is not as simple as it sounds so save this step for last.
4.0 Program Design Tips
The number one rule about writing a program from scratch is not to write
a single line of code until you have developed a good design. Developing
this good design is probably the hardest thing to learn how to do in
programming. Here are a few of my suggestions on how you should go about
designing and writing this project.
One of the things you will notice about this design is that Step 4 calls for a top-down approach (describe all of the higher level functions assuming the lower level ones are done and then move down to the next level). Where as, Step 5 calls for a bottom-up approach (write and test all of the low level functions before moving on to the higher level functions). I think if you follow this approach, you should find your programs (for this or any other class) get written much more quickly and with far fewer bugs.
5.0 Running Your Program
Once your program has been compiled, it should be run in the following
manner:
prompt> java WordCount keyWordFile textFile
The key word file should contain the list of words to search for. The text file is the file you will search looking for the key words. A sample of each file can be found below:
Key Word File
Text File
Results
Additional key word and text files will be provided later, but they will all have a similar format.
6.0 Handing in Your Project
The directory for handing in your program can be found at:
/p/course/cs367-mattmcc/public/username/p4
You should submit all *.java files that you created that are needed to compile your program. You should not submit any *.class files.
Also, be sure to comment your code well so that it can be easily browsed to see what is going on. Code that is excessively difficult to read will be penalized.
7.0 Grading
The project is due at 11:59 PM on Friday, August 9. NO LATE ASSIGNMENTS WILL
BE ACCEPTED! The project will be graded out of 100 points and be graded on
the following criteria:
All grading will be done on the Novas. This does not mean you have to do your project on a Nova, but if you use some other machine, make sure it works on a Nova before handing it in.
We will grade your code by running it against a series of test files and checking the output of the program against a known, correct implementation. Any differences in your output versus that of the correct implementation will result in a point deduction. Currently, the only test files provided are keys-1.dat and text-1.dat. Links to these files can be found in Section 4.0 above as well as the results of running this test. Without a doubt, much more in depth tests will be run than this initial one provided. This means you should right some of your own test files and check for any errors.
We will also examine parts of your code to make sure that you are implementing the data structures as required. If you are getting the correct results but not implementing the proper data strcucture, your score will suffer severely. You must implement the proper data structures.