
(Problems 1-5 are paper-and-pencil ones. Problems 2 and 5 in particular are intended to help get you up to speed on the algorithms underlying the programming portions of this homework.)
A B C D Prob F F F F 0.05 F F F T 0.10 F F T F 0.15 F F T T 0.25 F T F F 0.01 F T F T 0.02 F T T F 0.03 F T T T 0.04 T F F F 0.15 T F F T 0.02 T F T F 0.01 T F T T 0.02 T T F F 0.01 T T F T 0.02 T T T F 0.03 T T T T ?
exer = exercises regularly
flu = caught flu
Assume we know from past experience that:
P(exer) = 0.25
P(flu) = 0.33
P(flu | exer) = 0.20
(a) What's P(flu | ~exer)? ('~' means 'NOT')
(b) Given you find out someone has the flu, what's the probability they regularly exercise?
Be sure to show and explain your calculations for both parts (a) and (b).
(c) Liz has divided her books into two groups, those she likes and those she doesn't. For simplicity, assume no book contains a given word more than once.
The five (5) books that Liz likes contain (only) the following words:
animal (4 times), mineral (5 times), see (4 times), run(4 times)
The ten (10) books that Liz does not like contain (only) the following words:
animal (4 times), mineral (1 time), vegetable (9 times), see (2 times), spot (2 times), run (1 time)
Using the Naive Bayes assumption, determine whether it is more probable
that Liz likes the following book than that she dislikes it.
Again, be sure to show and explain your work. Be sure that
none of your probabilities are zero by starting all your counters at 1 instead of 0
(the counts above result from starting at 0. I.e., imagine that there
is one more book she likes that contains each of the above words exactly once and also one
more book she dislikes that also contains each of the above words exactly once).
see spot run vegetable // These words are the entire contents of this new book.
Hint: when you answer all three parts of this question, remember that if you store the value for, say, Prob(C=true | A=true and B=true) you do not need to also store the value for Prob(C=false | A=true and B=true) since we know that these two probabilities sum to one.
Consider the Bayesian network drawn below.
(You can print out this image directly. It is at
/p/course/cs540-shavlik/public/HWs/HW3_Q4_BN.jpg)
F1 = false F2 = false F3 = false category = +
F1 = false F2 = false F3 = true category = +
F1 = false F2 = true F3 = false category = +
F1 = true F2 = true F3 = true category = -
F1 = true F2 = false F3 = false category = -
F1 = true F2 = false F3 = true category = -
Assume we have the following test set example, whose category we wish to estimate based on a case-based approach where we find the three (3) nearest neighbors and use the most common category among them as our estimate:
F1 = false F2 = true F3 = true category = ?
Propose and apply a (simple) similarity function
to use on this problem.
The collected pairs are available at:
http://www.cs.wisc.edu/~shavlik/cs540/semanticSpellCheck/
(or /p/course/cs540-shavlik/public/semanticSpellCheck/ in AFS)
The specific word pairs you will be using are:
accept vs. except,
among vs. between,
good vs. well, and
their vs. there.
For each word pair there is a TRAIN set and a TEST set. Important: the train and test files have different organization, as explained here.
Using the training data in the collections of (presumed) proper usage, determine which of X or Y is more likely the proper usage in each of the test sentences. In other words, our goal is to identify likely usage errors, which we'll do as follows: whenever we see word X in a phrase, we'll see if using word Y instead leads to a more plausible phrase and when we see word Y we'll see if using word X instead looks better.
Penn Treebank Tagset List
Also, the Porter Stemmer was used to extract the "root" version of each word (e.g.,walk, walks, walked, and walking all get "stemmed" to WALK).
A phrase containing a word of interest is defined as follows:
the seven (7) words (tagged and stemmed) BEFORE the word X or Y
the center word, ie either word X or Y
(the center word is also tagged and stemmed)
the seven (7) words (tagged and stemmed) AFTER the word X or Y
Punctuation counts as one of the 7 before/after words.
Here is a sample phrase, though it uses only 3 before/after words for simplicity:
might [ MD MIGHT ] take [ VB TAKE ] to [ TO TO ] effect [ VB EFFECT ] it [ PRP IT ] , [* ,] but [ CC BUT ]
More information on the construction of these annotated data files is available at:
http://www.cs.wisc.edu/~shavlik/cs540/semanticSpellCheck/notationInDataFiles.txt
(or /p/course/cs540-shavlik/public/semanticSpellCheck/notationInDataFiles.txt in AFS)
The phrases were automatically extracted from twenty famous novels whose copyright has expired (some of the English might be a bit archaic).
The relative probabilities of the various word pairs were estimated from the following counts obtained by querying a search engine (we are holding back some word-pairs for possible use during grading):
accept: 27 million except: 172 million
among: 1.04 billion between: 626 million
good: 6.84 billion well: 5.33 billion
their: 6.85 billion there: 1.66 billion
You may use any (or none) of the annotation, the relative probabilities of the two words, some or all of the seven preceding and subsequent words, information combined over all of the words on the left or right of the center word, etc. in your similarity functions. For example, you might add 5 points to the similarity score if the word immediately to the left of the center position is an exact match in the two phrases; 3 points if the aren't an exact match but have the same root (i.e., stemmed) version; 2 if they are the same part of speech; and 1 if they are not the identical part of speech, but are related (e.g., proper noun vs. regular noun).
You may choose whichever K you wish; though you might wish to pick an odd-valued one - that way you can simply take a majority vote of the nearest neighbors when categorizing a test phrase (if you wish, you can also use some sort of weighting scheme to combine the answers in the K nearest neighbors, eg, weight "votes" by each neighbor's distance to the test phrase).
The intent is that you be creative in your design of your solution. Points for "creativity" and "effort" will be awarded during grading.
java CBR wordX wordY fractionXoverY fileOfTrainingCases fileOfTestPhrases
where
wordX one of the two words in the center of phrases
wordY the other one
fractionXoverY a floating point number, the ratio of
occurrences of wordX to that of wordY
fileOfTrainingCases the name of the file containing the "training" phrases
fileOfTestPhrases the name of the file containing the "testing" phrases
Your program should output, both for design #1 and design #2, the phrases on which your solutions get the wrong answers, as well as the overall fraction correct on the test phrases. It should also output the number of training phrases read and the number of testing phrases read.
Remember: the test file will NOT be in the same format as the training file. Your code should, for each pair of phrases in the test files, predict the correct center word. Then see if the prediction matches the correct word in the test phrase. This is essentially the same experimental design we used when applying the decision-tree algorithm to the HW1 testsets.
We have provided you with a sample class that you can use to begin building your CBR program. It manages the command-line parameters and reads the training and test files one line at a time. Read the inline comments for more information. Do not feel that you have to use this template; it is provided for your convenience if you would like to use it.
The calling convention should be the same, except instead of CBR use BayesNet. We have provided an optional template for this class here.
You might want to create your own train and test sets by pulling out small subsets of the provided training and test sets.
Notice, that to be fair, you really should make THREE data sets: Create a TUNING set, and use it and your TRAIN set to evaluate and debug your designs, then when COMPLETELY DONE, apply your solutions to the TEST set. (I.e., you really should try to not "peak" [oops, make that "peek"] at the test set in advance.) However, we won't be this meticulous and will only use training and testing files.
Electronically turn in your HW3 code in the same way you turned in HW1's code. Also turn in a printout of your commented code and a neatly written lab report that presents and justifies each of your two designs for each of the two approaches, and describes how well they worked on each of the four word pairs provided: accept-except, among-between, good-well, and their-there. Please be thorough in your explanations in order to receive full credit. Additionally, include instructions on how to compile and run your code in your lab report.
Our TA, Nick, will post the answers to common questions from the class here as they come in.