Due Monday, July 20th, at the start of class.
Write a Perl script that processes an input file, and counts the frequency of the line contents.
This program should take 2 command line arguments, the first being a input file name, the second being a threshold (number). This should exercise your knowlege of perl's strings, collections, and file handling capabilities.
The user will run the script, passing it the file name and the threshold value as command line parameters. The input file is a list of words, one word per line. The program will read this file, and count the frequency of each of the words. When done, it should then print out all words with a frequency count equal to or greater than the threshold from the command line. The script should be case insensitive -- that is, it should treat "the" and "The" as the same word.
% ./homework.pl homework.txt 1000 'of': 6568 'with': 1908 'at': 1754 'he': 3128 'a': 6602 'on': 2250 'was': 4220 'and': 6984 'she': 1727 'in': 4747 'from': 1117 'her': 1559 'be': 1294 'had': 2300 'that': 2640 'have': 1094 'they': 1101 'for': 2021 'it': 2551 'i': 3342 'as': 1663 'his': 1983 'said': 1430 'the': 15700 'by': 1213 'you': 1147 'were': 1194 'but': 1495 'is': 1180 'to': 6981 'not': 1286 %
The key to this is the proper use of perl's collection types. With proper selection, the problem becomes very straightforward.
Also, the script may be easier to test & debug with the use of a smaller dataset. The dataset provided in 'homework.txt' is over 250,000 lines. Ultimately, however, once you script works on a small dataset, it should work equally well on the larger dataset.
Do the work yourself, consulting reasonable reference materials as needed; any reference material that gives you a complete or nearly complete solution to this problem or a similar one is not OK to use. Asking the instructors for help is OK, asking other students for help is not.
A printout of your script on a single sheet of paper. At the top of the printout, please include “CS 368 Summer 2009”, your name, and “Homework 04, July 20, 2009”. Identifying your work is important, or you may not receive appropriate credit.