Due Tuesday, March 27, at the start of class.
Write a Python script that reads data from a file, analyzes the data, and prints a report.
Now that we can read files, we can start processing significantly more data than before. Use this new knowledge to read a data file (1.4 MB) containing one English word per line (263536 words), checking for errors if you wish, and then look for interesting statistics. Specifically, we are looking for words that occur with some uppercase letters more often than in all lowercase letters.
Your script will need to contain a definition for the input file name at the top of the script, like this:
filename = '...'
See below for a link to the input file that I provide.
The input file contains a list of words, one word per line. Your script will read this file, and count the frequency of each of the words.
HOWEVER, do not count any words that are all uppercase; they are not as interesting to deal with. Conveniently, Python has a string method for this task; check the Python documentation (remember how?) and try running the following code to see how it works:
print 'lower'.isupper() print 'Mixed'.isupper() print 'HELLO!'.isupper()
Once you have tallied all of the (not-all-uppercase) words in the file, you need to look for ones that occur both in mixed case (i.e., mixed upper- and lowercase letters) and all lowercase. In all of the instances that I found, the mixed-case version starts with an initial capital letter, so you could look for the word pairs that way, too.
For example: Both the words “yes” and “Yes” occur in the text, but the word “intelligent” occurs only in lowercase.
Of the pairs of words that are both mixed and lowercase, we would generally expect the lowercase form to be more common. So let’s look for the unusual cases! That is, we are looking for cases where the Mixed-case word is actually more frequent than its all-lowercase version. For example, “The” occurs 1619 times and “the” occurs 13410 times, so we are NOT interested in it. However, “Yes” occurs 39 times and “yes” occurs 17 times, so that is interesting.
For every interesting pair of words, as defined above, print one line with: (a) the ratio of mixed-case to lowercase counts, (b) the words, and (c) their actual frequencies in the text.
This is what my output looks like. Your output format may be different, but it must contain all the same elements.
11.000 - Hello (11) : hello (1) 2.412 - Cabinet (41) : cabinet (17) 1.143 - Eastern (8) : eastern (7) 9.000 - Sally (9) : sally (1) 1.750 - Commission (14) : commission (8) 27.000 - Author (243) : author (9) 2.333 - Museum (7) : museum (3) 1.500 - Performer (3) : performer (2) ....
I provide the input file: homework-04.txt for you to experiment with. Your script should run successfully on this file and any other ones like it.
In Python, use the lower() method on a string to return a copy of the string with all letters in lower case. For example:
some_text = 'Mixed Case' print some_text.lower()
A key aspect to this assignment is the proper use of Python’s collection types. With proper selection, the problem becomes very straightforward.
Also, the script may be easier to test and debug with the use of a smaller dataset. The dataset provided in “homework-04.txt” is over 250,000 lines. Ultimately, however, once you script works on a small dataset, it should work equally well on the larger dataset. In a Linux or Mac OS X (Terminal) shell, you can create a smaller dataset like this:
head -n 1000 homework-04.txt > homework-04-small.txt
Replace the 1000 with the number of lines/words that you want.
Want more challenge in the assignment? Yes, you do! Try these (which are still fairly easy):
Start your script the right way! Here is a suggestion:
#!/usr/bin/env python """Homework for CS 368-2 (2012 Spring) Assigned on Day 04, 2012-03-22 Written by <Your Name> """
Do the work yourself, consulting reasonable reference materials as needed. Any resource that provides a complete solution or offers significant material assistance toward a solution not OK to use. Asking the instructor for help is OK, asking other students for help is not. All standard UW policies concerning student conduct (esp. UWS 14) and information technology apply to this course and assignment.
A printout of your code on a single sheet of paper (if at all possible). Be sure to put your own name in the initial comment block of the code. Identifying your work is important, or you may not receive appropriate credit.