Exercise 10: Ciphers Part I

During this lab, you will create and hand in 1 file:

  • freq.py - write a program to calculate the frequency of letters in the English language. This may be completed in pairs if you wish.

Background

Over the next couple of weeks, we'll be looking at a relatively simple encryption algorithm called a Caesar Cipher.

Decoding a Caesar Cipher without a key is relatively straightforward, but requires that you know the relative frequencies of letters in the alphabet so that you can make the best guess possible. Next week we'll look at the decoding part; this week we'll just focus on figuring out how frequently letters appear in the English language.

Program Requirements

Your program will parse a file to determine the relative frequency of each letter in typical English language usage.

  1. Prompt the user for the path to a file.
    1. Remember, trying to open a file in read mode when the file doesn't exist will cause your program to crash! Make sure you check whether the file exists.
    2. If the file doesn't exist, display an error message and quit the program (see Sample Output).
  2. Read the file, line by line, and count how many times each letter occurs.
    1. Our files may get quite large, so we recommend not reading in the whole file at once.
    2. To determine whether a string contains only letters, you may want to use isdigit()'s cousin, isalpha().
    3. You may use any sort of structure you like to store the letter counts.
    4. We don't care about case - A should be counted in the same place as a.
    5. Helpful hint: if you want to convert between a letter (like A) and its ASCII value (A is 65 in ASCII) to use as an index into a list, check out the function ord(c), where c is a string with a single character.
  3. Calculate and display the relative frequency of each letter.
    1. The relative frequency of a letter is just how many times that letter appeared, divided by the total number of letters counted.
    2. You don't need to control how many decimal places are shown, but letters should be displayed in alphabetical order (A-Z).
    3. For the purposes of this assignment, you should display capital letters (even though the counts are case-insensitive).

Sample Output

File does not exist:

Path to a file to parse: doesnotexist.txt
Error: cannot find doesnotexist.txt

Using The Phoenix and the Carpet by E. Nesbitt in my current working directory:

Path to a file to parse: phoenix.txt
A: 0.0823032440936
B: 0.0167026632589
C: 0.0231893836332
D: 0.0487402415011
E: 0.12588051087
F: 0.0182794240056
G: 0.0206812339803
H: 0.0638771446696
I: 0.0632904429964
J: 0.00215246176355
K: 0.0101719402591
L: 0.0410984522076
M: 0.0201972050999
N: 0.068937446601
O: 0.0762565499741
P: 0.0182610895783
Q: 0.000946056448035
R: 0.0561106812706
S: 0.0581128007304
T: 0.0969891203508
U: 0.0286347085376
V: 0.00781413290993
W: 0.0253748473659
X: 0.00269516081126
Y: 0.0229180341094
Z: 0.000385022973037

Using The Federalist Papers by Alexander Hamilton in the books directory:

Path to a file to parse: books/federalist.txt
A: 0.0717240387036
B: 0.0165059657185
C: 0.0333123861393
D: 0.0332700685756
E: 0.13360712797
F: 0.0270144747227
G: 0.0142853515637
H: 0.051538560822
I: 0.076517560731
J: 0.00221426652025
K: 0.00218993392112
L: 0.0352981378156
M: 0.0234100762351
N: 0.0730242458481
O: 0.0789381253742
P: 0.0231646343657
Q: 0.00142187014009
R: 0.0590922459412
S: 0.0626151831187
T: 0.107156534784
U: 0.0283146818671
V: 0.0114997979336
W: 0.0151211234467
X: 0.00283421882835
Y: 0.0154279257835
Z: 0.000501463129765

Using Are Women People? by Alice Duer Miller in my current working directory:

Path to a file to parse: ./suffrage.txt
A: 0.0755596455078
B: 0.0175245563847
C: 0.028387380719
D: 0.0360493728369
E: 0.125452617681
F: 0.0223258047092
G: 0.0201052273591
H: 0.0488527017024
I: 0.0659371436574
J: 0.00270070218257
K: 0.00798207533959
L: 0.03861003861
M: 0.0277272090744
N: 0.0693380278873
O: 0.0841618820893
P: 0.0211454978294
Q: 0.000840218456799
R: 0.0638966131194
S: 0.0605957548963
T: 0.0924640406506
U: 0.0320083221638
V: 0.00962250185048
W: 0.023446095985
X: 0.00186048372577
Y: 0.0231460179647
Z: 0.000260067617581

Feel free to try other books from Project Gutenberg and see what distributions you get - maybe look at some other languages' character distribution frequencies and compare!

How do we approach this?

We're back to our fundamental concept: the counting program. The only thing different this time is the source of the stuff you're counting.

There are no requirements for your program's internal structure this week, only its behavior, but keep in mind that we'll be using the results from this week in next week's program (so you may want to create some functions you can reuse).

When we've dealt with counting in the past, we've been keeping track of no more than a few different counts. It made sense to store these in individual variables. Now, we're dealing with 26 counts. We don't want to manually create 26 count variables; while it might work, it's a very messy approach and we can do much better.

The better approach would involve using one of the data structures we've learned about to store these counts. Using a dictionary should be relatively straightforward (think about what the keys and what the values should be), but how could we use a list? We know a list with 26 elements would have index 0 through 25, inclusively. If we read the letter 'c', how do we know what index we need to use to access the list? This is where the above suggestion about ord() comes into play - think about how you could use that to accomplish this.

Another important consideration is counter initialization. Just like we've done in the past, we need to initialize any counter before we actually process them. Consider where it is appropriate to place that initialization code, and what value you should initialize each counter to.

Commenting your code

Once again, some of your program points will come from commenting your code.

  • Add a 1-2 sentence description of the program and its function to the top of your code.
  • Include comments within your code describing what you're doing. You don't need to comment on every line of code, but be clear in your explanations.

Submitting your files

As usual, you'll be handing in your lab work via the course Learn@UW dropboxes. Navigate to our 301 course page, and click the Dropbox link in the top navigation bar. You should see a dropbox for Program 10 - this is where you should hand in your freq.py file.

Note that the dropbox will close at noon on 14 April, so be sure to submit your files before then.