
(Problems 1-5 are paper-and-pencil ones. Problems 2 and 5 in particular are intended to help get you up to speed on the algorithms underlying the programming portions of this homework.)
A B C D Prob F F F F 0.10 F F F T 0.01 F F T F 0.05 F F T T 0.15 F T F F 0.02 F T F T 0.03 F T T F 0.04 F T T T 0.05 T F F F 0.20 T F F T 0.01 T F T F 0.01 T F T T 0.03 T T F F 0.02 T T F T 0.04 T T T F 0.08 T T T T ?
wh = washes hands regularly
cc = caught cold
Assume we know from past experience that:
P(wh) = 0.75
P(cc) = 0.25
P(cc | wh) = 0.10
(a) What's P(cc | ~wh)? ('~' means 'NOT')
(b) Given you find out someone has a cold, what's the probability they regularly wash their hands?
Be sure to show and explain your calculations for both parts (a) and (b).
(c) Margaret has divided her books into two groups, those she likes and those she doesn't.
The five (5) books that Margaret likes contain (only) the following words:
animal (7 times), mineral (8 times), vegetable (5 times), see (3 times)
The ten (10) books that Margaret does not like contain (only) the following words:
animal (4 times), mineral (1 times), vegetable (6 times), spot (2 time), run (9 times)
Using the Naive Bayes assumption, determine whether it is more probable
that Margaret likes the following book than that she dislikes it.
Again, be sure to show and explain your work.
see spot spot animal run // These words are the entire contents of this new book.
Hint: when you answer all three parts of this question, remember that if you store the value for, say, Prob(C=true | A=true and B=true) you do not need to also store the value for Prob(C=false | A=true and B=true) since we know that these two probabilities sum to one.
Consider the Bayesian network drawn below.
(You can print out this image directly. It is at
/p/course/cs540-shavlik/public/HWs/HW3_Q4_BN.jpg)
F1 = false F2 = false F3 = false category = +
F1 = false F2 = false F3 = true category = -
F1 = false F2 = true F3 = false category = +
F1 = true F2 = true F3 = true category = +
F1 = true F2 = false F3 = false category = -
F1 = true F2 = false F3 = true category = +
Assume we have the following test set example, whose category we wish to estimate based on a case-based approach where we find the three (3) nearest neighbors and use the most common category among them as our estimate:
F1 = false F2 = true F3 = true category = ?
Propose and apply a (simple) similarity function
to use on this problem.
The collected pairs are available at:
http://www.cs.wisc.edu/~shavlik/cs540/semanticSpellCheck/
(or /p/course/cs540-shavlik/public/semanticSpellCheck/ in AFS)
The specific word pairs you will be using are:
affect vs. effect,
right vs. write, and
too vs. two.
For each word pair there is a TRAIN set and a TEST set. Important: the train and test files have different organization, as explained here.
Using the training data in the collections of (presumed) proper usage, determine which of X or Y is more likely the proper usage in each of the test sentences. In other words, our goal is to identify likely usage errors, which we'll do as follows: whenever we see word X in a phrase, we'll see if using word Y instead leads to a more plausible phrase and when we see word Y we'll see if using word X instead looks better.
http://www.cs.wisc.edu/~shavlik/cs540/semanticSpellCheck/tag-meanings.txt
(or /p/course/cs540-shavlik/public/semanticSpellCheck/tag-meanings.txt in AFS)
Also, the Porter Stemmer was used to extract the "root" version of each word (e.g.,walk, walks, walked, and walking all get "stemmed" to WALK).
A phrase containing a word of interest is defined as follows:
the seven (7) words (tagged and stemmed) BEFORE the word X or Y
the center word, ie either word X or Y
(the center word is also tagged and stemmed)
the seven (7) words (tagged and stemmed) AFTER the word X or Y
Punctuation counts as one of the 7 before/after words.
Here is a sample phrase, though it uses only 3 before/after words for simplicity:
might [ MD MIGHT ] take [ VB TAKE ] to [ TO TO ] effect [ VB EFFECT ] it [ PRP IT ] , [* ,] but [ CC BUT ]
More information on the construction of these annotated data files is available at:
http://www.cs.wisc.edu/~shavlik/cs540/semanticSpellCheck/notationInDataFiles.txt
(or /p/course/cs540-shavlik/public/semanticSpellCheck/notationInDataFiles.txt in AFS)
The phrases were automatically extracted from some on-line novels whose copyright has expired (some of the English might be a bit archaic).
The relative probabilities of the various word pairs were estimated from the following counts obtained by querying a search engine some number of years ago (we are holding back some word-pairs for possible use during grading):
accept: 1,030,205 except: 1,365,015
affect: 562,435 effect: 1,771,070
right: 10,662,780 write: 4,559,945
there: 20,287,830 their: 17,427,090
too: 4,629,770 two: 16,896,825
You may use any (or none) of the annotation, the relative probabilities of the two words, some or all of the seven preceding and subsequent words, information combined over all of the words on the left or right of the center word, etc. in your similarity functions. For example, you might add 5 points to the similarity score if the word immediately to the left of the center position is an exact match in the two phrases; 3 points if the aren't an exact match but have the same root (i.e., stemmed) version; 2 if they are the same part of speech; and 1 if they are not the identical part of speech, but are related (e.g., proper noun vs. regular noun).
You may choose whichever K you wish; though you might wish to pick an odd-valued one - that way you can simply take a majority vote of the nearest neighbors when categorizing a test phrase (if you wish, you can also use some sort of weighting scheme to combine the answers in the K nearest neighbors, eg, weight "votes" by each neighbor's distance to the test phrase).
The intent is that you be creative in your design of your solution. Points for "creativity" and "effort" will be awarded during grading.
java CBR wordX wordY fractionXoverY fileOfTrainingCases fileOfTestPhrases
where
wordX one of the two words in the center of phrases
wordY the other one
fractionXoverY a floating point number, the ratio of
occurrences of wordX to that of wordY
fileOfTrainingCases the name of the file containing the "training" phrases
fileOfTestPhrases the name of the file containing the "testing" phrases
Your program should output, both for design #1 and design #2, the phrases on which your solutions get the wrong answers, as well as the overall fraction correct on the test phrases. It should also output the number of training phrases read and the number of testing phrases read.
Remember: the test file will NOT be in the same format as the training file. Your code should, for each pair of phrases in the test files, predict the correct center word. Then see if the prediction matches the correct word in the test phrase. This is essentially the same experimental design we used when applying the decision-tree algorithm to the HW1 testsets.
You probably will want to write code that reads the contents of
a file into a string(see getFileContents
in
/p/course/cs540-shavlik/public/HWs/Utils.java),
and then uses a StringTokenizer to
"walk" through each of the "words" (ie, words and punctuation symbols)
in the the string.
The calling convention should be the same, except instead of CBR use naiveBayes.
You might want to create your own train and test sets by pulling out small subsets of the provided training and test sets.
Notice, that to be fair, you really should make THREE data sets: Create a TUNING set, and use it and your TRAIN set to evaluate and debug your designs, then when COMPLETELY DONE, apply your solutions to the TEST set. (I.e., you really should try to not "peak" [oops, make that "peek"] at the test set in advance.) However, we won't be this meticulous and will only use training and testing files.
Electronically turn in your HW3 code in the same way you turned in HW1's code. Also turn in a printout of your commented code and a neatly written lab report that presents and justifies each of your two designs for each of the two approaches, and describes how well they worked on each of the three word pairs provided, affect-effect, right-write, and too-two.