Each line in the data files contains the SEVEN (7) words BEFORE and
the SEVEN (7) AFTER the word in question (eg "affect" or "effect").

Each word is annotated as follows:

     WORD [ POS-tag STEM'ed-version ]

where "POS-tag" is the estimated "part of speech," produced by the
Stanford NLP part-of-speech tagger; see 
http://www.computing.dcu.ie/~acahill/tagset.html for an explanation
of the short names.  And, where "STEM'ed-version" is the result of
applying the Porter stemmer to WORD.  (You can find the Stanford and
Porter s/w on the web if you'd like, however it isn't necessary for
this HW since we've preprocessed all the data already.)

You should assume that all the TRAINING data uses the correct "center"
word, though it is of course possible that authors made grammatical
errors.  This is just another source of possible "noise" in our data.

In the TEST set, the job of your algorithms is to PREDICT WHICH OF THE
TWO WORDS IN A GIVEN PAIR belongs in the center position (ie, in order
to detect "spelling" errors).  

It is very important to notice that the TEST-SET files are different
than the training-set files.  In the training set, each phrase
appears once with the (presumably) correct center word.
However, in the TEST examples, we include TWO versions of each
phrase.  The first version uses the CORRECT center word
and the second version of the phrase uses the INCORRECT center word.
We need to do this because the POS tagger can tag the neighboring
words differently depending on what word is in the center.

In other words, imagine that in the testset, the middle word is a
XXXXX (ie, an unknown word), and the job of
your algorithms is to predict which of the two possible center
words (eg, affect vs. effect) is more likely.  Your code can then
count how often the correct guess is made, since, just like in
HW1, the test sets contain the right answer.  (Remember, even
though YOU know that the first of the two paired phrases in the testset
is always the correct one, your algorithms - other than the code that
measures correctness - must not make use of this fact.)

PS - This data was obtained from twenty iconic novels from the free
e-books site http://www.gutenberg.org/. You may even recognize some
of the phrases if you have read some of these novels in an English
literature class. The data has not been culled for offensive language, 
except for a few occurrences I noticed; if anyone notices any more I'll 
be glad to delete those lines.