CS 368-3 (2012 Summer) — Day 4 Homework

Due Monday, July 2, at the start of class.

Goal

Write a Perl script that reads data from a file, analyzes the data, and prints a report.

Tasks

Now that we can read files, we can start processing significantly more data than before. Use this new knowledge to read a data file (1.4 MB) containing one English word per line (263536 words), then look for interesting statistics. Specifically, we are looking for words that occur with some uppercase letters more often than in all lowercase letters.

Your script will need to contain a definition for the input file name at the top of the script, like this:

my $filename = '...';

See below for a link to the input file that I provide.

The input file contains a list of words, one word per line. Your script will read this file, and count the frequency of each of the words.

HOWEVER, do not count any words that are all uppercase; they are not as interesting to deal with. Perl does not contain a simple function to check whether a string is all uppercase. But, you can easily convert a string to all uppercase:

my $string = "Mixed_Case";
print "The string '$string', in all uppercase, is '" . uc($string) . "'\n";

Use that function to test whether the original string is equal to the converted string; if so, then the original must have been in all uppercase.

Once you have tallied all of the (not-all-uppercase) words in the file, you need to look for ones that occur both in mixed case (i.e., mixed upper- and lowercase letters) and all lowercase. In all of the instances that I found, the mixed-case version starts with an initial capital letter, so you could look for the word pairs that way, too.

For example: Both the words “yes” and “Yes” occur in the text, but the word “intelligent” occurs only in lowercase.

Of the pairs of words that are both mixed and lowercase, we would generally expect the lowercase form to be more common. So let’s look for the unusual cases! That is, we are looking for cases where the Mixed-case word is actually more frequent than its all-lowercase version. For example, “The” occurs 1619 times and “the” occurs 13410 times, so we are NOT interested in it. However, “Yes” occurs 39 times and “yes” occurs 17 times, so that is interesting.

For every interesting pair of words, as defined above, print one line with: (a) the ratio of mixed-case to lowercase counts, (b) the words, and (c) their actual frequencies in the text.

Sample Output

This is what my output looks like. Your output format may be different, but it must contain all the same elements.

11.000 - Hello (11) : hello (1)
 2.412 - Cabinet (41) : cabinet (17)
 1.143 - Eastern (8) : eastern (7)
 9.000 - Sally (9) : sally (1)
 1.750 - Commission (14) : commission (8)
27.000 - Author (243) : author (9)
 2.333 - Museum (7) : museum (3)
 1.500 - Performer (3) : performer (2)
....

Input File

I provide the input file: homework-04.txt for you to experiment with. Your script should run successfully on this file and any other ones like it.

Perl Tips

Similar to the uc() function, use the lc() function on a string to return a copy of the string with all letters in lower case. For example:

my $some_text = 'Mixed Case';
print lc($some_text) . "\n";

A key aspect to this assignment is the proper use of Perl’s collection types. With proper selection, the problem becomes very straightforward.

Also, the script may be easier to test and debug with the use of a smaller dataset. The dataset provided in “homework-04.txt” is over 250,000 lines. Ultimately, however, once you script works on a small dataset, it should work equally well on the larger dataset. In a Linux or Mac OS X (Terminal) shell, you can create a smaller dataset like this:

head -n 1000 homework-04.txt > homework-04-small.txt

Replace the 1000 with the number of lines/words that you want.

Testing

Testing your code is very important! If you do not run your code, you will make mistakes and probably not receive full credit. And even just running your code once may not be enough. Try different cases and make sure things work as you expect.

Here are some specific tests to consider:

Does the script avoid printing (and counting) UPPERCASE words?
Verify by hand the calculation of the ratio for a few pairs.
Use another tool to verify the counts of a few words
For example, on Linux or Mac OS X, you can do something like this, for any given word (instead of hello, letter case must match exactly):
```
grep '^Hello$' homework-04.txt | wc -l
```

Extra Challenges

Want more challenge in the assignment? Yes, you do! Try these (which are still fairly easy):

Instead of printing your report to the screen, write it to a file instead. Do not overwrite your input file by accident! (^_^)
Handle errors that come from file operations, where appropriate.

Reminders

Start your script the right way! Here is a suggestion:

#!/usr/bin/perl
# Homework for CS 368-3
# Assigned on Day 04, 2012-06-28
# Written by Your Name

use strict;
use warnings;

Do the work yourself, consulting reasonable reference materials as needed. Any resource that provides a complete solution or offers significant material assistance toward a solution not OK to use. Asking the instructor for help is OK, asking other students for help is not. All standard UW policies concerning student conduct (esp. UWS 14) and information technology apply to this course and assignment.

Hand In

A printout of your code, ideally on a single sheet of paper. Be sure to put your own name in the “<Your Name>” part of the code. Identifying your work is important, or you may not receive appropriate credit.