CS367 Programming Assignment 3
Why are we doing this program?
In this assignment you will be writing a Java program that creates a word cloud for a text file. A word cloud is a way to visually represent information in a text file; key words from the text file are displayed with the importance of each word indicated by font size and/or color. For the purposes of this assignment, key words will be any words that show up in the text file that are not in a provided list of words to ignore. The importance of a word will be determined by how many times the word appears in the text file. The word cloud created from the text file will be saved to a webpage. Your program will take four command-line arguments: the name of the input text file, the name of the output file (i.e., the webpage), the maximum number of words to include in the word cloud, and the name of the text file containing the words to ignore.
To construct the word cloud, your program will first go through the text file and determine the key words that appear and how many times each key word shows up. Each key word can be thought of as a (word, # of occurrences) pair. Note that, by definition, the words in the pairs are unique. A Dictionary is an ADT that stores unique key values and provides operations to add and remove information as well as to traverse the key values in order. This makes the Dictionary a useful ADT to store the key word information. For this program, you will implement a Dictionary ADT using a binary search tree. After collecting all the key word information, your program will find the N key words with the most occurrences to include in the word cloud (where N is the maximum number of words to include, specified by the user as a command-line argument). To do this, your program will put all the the key words into a Priority Queue (prioritized by the number of occurrences) and then remove the required number of key words. For this program, you will implement a Priority Queue using an array-based heap.
The goals of this assignment are to:
What are the program requirements?
The KeyWord Class
The dictionary created from the input file will store KeyWord objects, each of which contains a word and a non-negative integer representing the number of times the word occurs in the input file. For the purposes of the KeyWord class, a word is a non-empty sequence of characters in which all the letters have been converted to lower-case (we'll add some more restrictions on what we consider to be a word in the main class). The javadoc documentation for KeyWord contains the complete details for each method and constructor. Note that the KeyWord class implements both the Comparable<KeyWord> interface and the Prioritizable interface (more below) and that you will need to override the equals method inherited from the Object class.
We have specified how dictionaries are to work in the DictionaryADT interface (see javadoc documentation, DictionaryADT.java source). Note that the insert method throws a DuplicateException (see javadoc documentation, DuplicateException.java).
Total path length
One of the methods in the DictionaryADT is totalPathLength. The total path length is the sum of the lengths of the paths to each key in the dictionary. This can be used to give us a measure of how many keys must be searched, on average, to find a specific key (by taking the total path length and dividing by the number of keys stored in the dictionary).
For example, if we implement a dictionary using a singly-linked chain of nodes kept in sorted order, then the total path length of a dictionary containing seven keys is:
1 + 2 + 3 + 4 + 5 + 6 + 7 = (7 × 8) / 2 = 28
and the average path length is 28 / 7 = 4. If our dictionary containing the seven keys is a full binary tree, then the total path length is:
1 + 2 + 2 + 3 + 3 + 3 + 3 = 1 + (2 × 2) + (4 × 3) = 17
and the average path length is 17 / 7 = 2.42857... In general, a singly-linked chain containing N nodes has a total path length of N × (N + 1) / 2 and an average path length of (N + 1) / 2. For a binary tree, the total path length is the sum of the depths of the nodes (since the depth of each node is the length of the path from the root to that node). This leads us to the following recursive definition for the total path length for a binary tree starting at a node N that is at a depth D:
The BSTDictionary class
In a file, named BSTDictionary.java, you will code a class that implements the DictionaryADT interface (see BSTDictionary.java shell) using a binary search tree of BSTnodes (see javadoc documentation, BSTnode.java). Note the following:
The BSTDictionaryIterator class
The BSTDictionary class also has an iterator. In a file, named BSTDictionaryIterator.java, you will code the iterator (see BSTDictionaryIterator.java shell). You need only implement the hasNext and next methods of Java's Iterator interface. Note that the iterator returns the key values in order from smallest to largest. In order to receive full credit:
Implementation hint: an implementation of the constructor that pushes all of the nodes in the binary search tree onto a stack in the constructor will not get full credit (since that will have a complexity of O(N) where N is the number of nodes in the tree). Instead, the constructor should make a stack and only push enough nodes to get to the first item to return when next() is called. When next() is called, it should get a node off the stack and push any necessary nodes needed so the next time next() is called, it will return the next value in order.
Do not change the contents of the BSTnode.java, DictionaryADT.java, or DuplicateException.java source files!
The Priority Queue
We have specified how priority queues are to work in the PriorityQueueADT interface (see javadoc documentation, PriorityQueueADT.java source). Note that the getMax and removeMax methods throw NoSuchElementException when they are called on an empty priority queue. For this assignment (in order to make the coding simpler), priority queues contain items that implement the Prioritizable interface (see javadoc documentation, Prioritizable.java source). A class (such as KeyWord) that implements Prioritizable must provide a getPriority method that returns an integer value representing the priority of an item (where larger values correspond to higher priorities).
The ArrayHeap class
In a file, named ArrayHeap.java, you will code a class that implements the PriorityQueueADT interface (see ArrayHeap.java shell) using an array-based implementation of a max heap. In addition to the methods specified in the PriorityQueueADT interface, the ArrayHeap class provides two constructors: a default (no argument) constructor and a constructor that takes an initial size (an integer) for the underlying array. Your ArrayHeap class must compare elements in the heap using the values returned by getPriority.
Implementation hint: because the generic type E must be something that is Prioritizable, to create an array of type E, use the following (where size_of_array is the size of the array you are creating):
Do not change the contents of the Prioritizable.java or PriorityQueueADT.java source files!
The WordCloudGenerator Class
The WordCloudGenerator class is the main class of the program. The main method will do the following:
The WordCloudGenerator.java file contains the outline of the WordCloudGenerator class. Download this file and use it as the starting point for your WordCloudGenerator implementation.
Breaking English text into individual words is not as straight-forward as it might seem (for example, just using the String.split method to parse text using white-space to identify where words begin and end results in words that contain punctuation, like " or ?). To make things easier, we have provided the code that divides (i.e., parses) Strings into individual words. For our purposes, we will consider a word to be a non-empty sequence of characters that starts and ends with either a letter or a digit, contains no white-space, and contains at least one letter. The WordCloudGenerator class includes a parseLine method that takes a String, breaks it up into individual words, and returns a list of those words in the order they appear in the String. Do not change the parseLine method.
A method to generate the appropriate html code for your word cloud is provided for you in the WordCloudGenerator class. The generateHtml method takes as its parameters a dictionary of key words and a PrintStream to which to send output. It determines the minimum and maximum number of occurrences in the given dictionary of key words. It then uses linear interpolation to map the number of occurrences for each key word to the appropriate font size and color.
WordCloudGenerator operates in two modes. If simpleDisplay is set to true, the generated HTML will place all words in alphabetic order with size and colour denoting word frequency (example here). If simpleDisplay is set to false, a more attractive two dimensional cloud will be generated. Popular words are placed in the centre with less popular words at the edges. The layout is targeted toward an 8.5 by 11 format so that the cloud can be easily printed. (example here).
Do not change the generateHtml method.
How to proceed
After you have read this program page and given thought to the problem we suggest the following steps:
What should be handed in?
Electronically submit the following files to the Program 3 Dropbox on Learn@UW(or wherever!):
Please turn in only the file named above. Extra files clutter up the Dropbox directories.
|Last Updated: 8/11/2017 © 2014-17 Beck Hasti and Charles Fischer|