CS540 HW3: Probabilistic Reasoning, Case-Based Reasoning, and Natural Language Processing

Assigned: 11/2/11
Due: 11/18/11 (cannot be turned in later than 11/23/11 at 4pm, due to Thanksgiving)
Value: 150 points

(Problems 1-5 are paper-and-pencil ones. Problems 2 and 5 in particular are intended to help get you up to speed on the algorithms underlying the programming portions of this homework.)

Problem 1: Full Joint Probability Distributions (15 points)

Consider this full joint probability distribution involving four Boolean-valued random variables (A-D):

  A   B   C   D     Prob

  F   F   F   F     0.05
  F   F   F   T     0.10
  F   F   T   F     0.15
  F   F   T   T     0.25

  F   T   F   F     0.01
  F   T   F   T     0.02
  F   T   T   F     0.03
  F   T   T   T     0.04

  T   F   F   F     0.15
  T   F   F   T     0.02
  T   F   T   F     0.01
  T   F   T   T     0.02

  T   T   F   F     0.01
  T   T   F   T     0.02
  T   T   T   F     0.03
  T   T   T   T      ?
  1. Compute P(A = true and B = true and C = true and D = true).

  2. Compute P(A = false and D = false).

  3. Compute P(A = true).

  4. Compute P(A = true | B = true and C = true and D = true).

  5. Compute P(A = false and B = false | C = false and D = true).

Problem 2: Bayes' Rule and Text Processing (20 points)

Define the following two variables about people:
      exer       =  exercises regularly
      flu        =  caught flu
Assume we know from past experience that:
      P(exer)        =  0.25
      P(flu)         =  0.33
      P(flu | exer)  =  0.20

(a) What's P(flu | ~exer)? ('~' means 'NOT')

(b) Given you find out someone has the flu, what's the probability they regularly exercise?

Be sure to show and explain your calculations for both parts (a) and (b).

(c) Liz has divided her books into two groups, those she likes and those she doesn't. For simplicity, assume no book contains a given word more than once.

      The five (5) books that Liz likes contain (only) the following words:

	animal (4 times), mineral (5 times), see (4 times), run(4 times)

      The ten (10) books that Liz does not like contain (only) the following words:

	animal (4 times), mineral (1 time), vegetable (9 times), see (2 times), spot (2 times), run (1 time)

Using the Naive Bayes assumption, determine whether it is more probable that Liz likes the following book than that she dislikes it. Again, be sure to show and explain your work. Be sure that none of your probabilities are zero by starting all your counters at 1 instead of 0 (the counts above result from starting at 0. I.e., imagine that there is one more book she likes that contains each of the above words exactly once and also one more book she dislikes that also contains each of the above words exactly once).

	see spot run vegetable	 	// These words are the entire contents of this new book.

 

Problem 3 - Representing Probability Distributions (15 points)

Assume the task at hand involves 26 Boolean-valued random variables, which we'll name A through Z.

  1. How big of a table (number of memory cells) would be need to explicitly represent the full, joint probability distribution over every possible combination of our 26 Boolean-valued random variables?

  2. How big of a table would we need if we make the conditional independence assumption that each variable is independent of all other variables conditioned on the value of Z?

  3. This time assume that we have a Bayesian network where the following nodes have the parents listed (if a node is not listed, then it has no parents in the Bayesian network):

Draw this Bayesian network and next to each node report how many cells are needed to store that node's conditional probability table (CPT). Explain your answer (you need not explain your answer within your drawing of the Bayesian network - it is fine to place your explanation below your drawing). Finally, report the total number of cells needed to store this Bayesian network (be sure to count the memory needed to store the parent links - assume each link uses the same number of bytes as one cell in a probability table, ie, each parent link counts 1).

Hint: when you answer all three parts of this question, remember that if you store the value for, say, Prob(C=true | A=true and B=true) you do not need to also store the value for Prob(C=false | A=true and B=true) since we know that these two probabilities sum to one.

Problem 4 - Bayesian Networks (15 points)

Consider the Bayesian network drawn below.

Bayesian Network for Problem 4 of HW3

(You can print out this image directly. It is at /p/course/cs540-shavlik/public/HWs/HW3_Q4_BN.jpg)

Show your work for the following calculations.

  1. Compute P(A = true and B = true and C = true and D = true).

  2. Compute P(D = true | A = true and B = true and C = true).

  3. Compute P(B = true | A = false and C = false).

  4. Compute P(C = false | B = false).

Problem 5: Case-Based Reasoning (10 points)

Imagine that we have the following examples, represented using three Boolean-valued features:
    F1 = false  F2 = false  F3 = false    category = +
    F1 = false  F2 = false  F3 = true     category = +
    F1 = false  F2 = true   F3 = false    category = +
    F1 = true   F2 = true   F3 = true     category = -
    F1 = true   F2 = false  F3 = false    category = -
    F1 = true   F2 = false  F3 = true     category = -

Assume we have the following test set example, whose category we wish to estimate based on a case-based approach where we find the three (3) nearest neighbors and use the most common category among them as our estimate:

    F1 = false   F2 = true   F3 = true   category = ?
Propose and apply a (simple) similarity function to use on this problem.

Problem 6: Creating a "Semantic" Spelling Checker (75 points)

In the remainder of this homework, you will design, implement, and test two approaches for addressing the following task.

Given

For various pairs of often confused words, such as accept versus except, we have collected sample phrases that include these words; we assume that these sample uses are correct.

The collected pairs are available at:

     http://www.cs.wisc.edu/~shavlik/cs540/semanticSpellCheck/
     (or /p/course/cs540-shavlik/public/semanticSpellCheck/ in AFS)
The specific word pairs you will be using are: accept vs. except, among vs. between, good vs. well, and their vs. there.

For each word pair there is a TRAIN set and a TEST set. Important: the train and test files have different organization, as explained here.

Do

Call the chosen paired words X and Y and consider "testset" sentences containing either word X or Y.

Using the training data in the collections of (presumed) proper usage, determine which of X or Y is more likely the proper usage in each of the test sentences. In other words, our goal is to identify likely usage errors, which we'll do as follows: whenever we see word X in a phrase, we'll see if using word Y instead leads to a more plausible phrase and when we see word Y we'll see if using word X instead looks better.

Additional Details

The sentences are augmented with additional linguistic knowledge. The Stanford Natural Language Processing Group's part-of-speech (POS) tagger was applied to label each word in its context, as a noun, verb, adjective, etc. The table of POS codes used to label the words in the sample phrases is available at:
     Penn Treebank Tagset List

Also, the Porter Stemmer was used to extract the "root" version of each word (e.g.,walk, walks, walked, and walking all get "stemmed" to WALK).

A phrase containing a word of interest is defined as follows:

     the seven (7) words (tagged and stemmed) BEFORE the word X or Y

     the center word, ie either word X or Y
     (the center word is also tagged and stemmed)

     the seven (7) words (tagged and stemmed) AFTER the word X or Y
Punctuation counts as one of the 7 before/after words.

Here is a sample phrase, though it uses only 3 before/after words for simplicity:

  might [ MD MIGHT ] take [ VB TAKE ] to [ TO TO ] effect [ VB EFFECT ] it [ PRP IT ] , [* ,] but [ CC BUT ] 

More information on the construction of these annotated data files is available at:

     http://www.cs.wisc.edu/~shavlik/cs540/semanticSpellCheck/notationInDataFiles.txt
     (or /p/course/cs540-shavlik/public/semanticSpellCheck/notationInDataFiles.txt in AFS)

The phrases were automatically extracted from twenty famous novels whose copyright has expired (some of the English might be a bit archaic).

The relative probabilities of the various word pairs were estimated from the following counts obtained by querying a search engine (we are holding back some word-pairs for possible use during grading):

     accept:    27 million    except:  172 million 
     among:   1.04 billion    between: 626 million
     good:    6.84 billion    well:   5.33 billion
     their:   6.85 billion    there:  1.66 billion 

Case-Based Reasoning

Design, justify, implement, and evaluate TWO (2) reasonably different "similarity" functions for finding the K nearest neighbors to a test phrase. I.e., this function should compare the test phrase to all the known phrases (for the word-pair in question), find the K most similar previous cases, then output the most likely word for the center of the test phrase.

You may use any (or none) of the annotation, the relative probabilities of the two words, some or all of the seven preceding and subsequent words, information combined over all of the words on the left or right of the center word, etc. in your similarity functions. For example, you might add 5 points to the similarity score if the word immediately to the left of the center position is an exact match in the two phrases; 3 points if the aren't an exact match but have the same root (i.e., stemmed) version; 2 if they are the same part of speech; and 1 if they are not the identical part of speech, but are related (e.g., proper noun vs. regular noun).

You may choose whichever K you wish; though you might wish to pick an odd-valued one - that way you can simply take a majority vote of the nearest neighbors when categorizing a test phrase (if you wish, you can also use some sort of weighting scheme to combine the answers in the K nearest neighbors, eg, weight "votes" by each neighbor's distance to the test phrase).

The intent is that you be creative in your design of your solution. Points for "creativity" and "effort" will be awarded during grading.

Calling Conventions

You should write a Java class with the following calling conventions:
      java CBR wordX wordY fractionXoverY fileOfTrainingCases fileOfTestPhrases
where
     wordX               one of the two words in the center of phrases
     wordY               the other one
     fractionXoverY      a floating point number, the ratio of
                         occurrences of wordX to that of wordY
     fileOfTrainingCases the name of the file containing the "training" phrases
     fileOfTestPhrases   the name of the file containing the "testing"  phrases

Your program should output, both for design #1 and design #2, the phrases on which your solutions get the wrong answers, as well as the overall fraction correct on the test phrases. It should also output the number of training phrases read and the number of testing phrases read.

Remember: the test file will NOT be in the same format as the training file. Your code should, for each pair of phrases in the test files, predict the correct center word. Then see if the prediction matches the correct word in the test phrase. This is essentially the same experimental design we used when applying the decision-tree algorithm to the HW1 testsets.

We have provided you with a sample class that you can use to begin building your CBR program. It manages the command-line parameters and reads the training and test files one line at a time. Read the inline comments for more information. Do not feel that you have to use this template; it is provided for your convenience if you would like to use it.

Bayesian Reasoning

Repeat the above, though this time instead of creating two similarity functions, devise, justify, and compare two (2) reasonably different designs for applying the Bayes Network algorithm. It is acceptable to use Naive Bayes for both designs (though there should be some interesting differences between your two approaches), but you might also want to create another Bayesian network that goes beyond Naive Bayes.

The calling convention should be the same, except instead of CBR use BayesNet. We have provided an optional template for this class here.

Experimentation to Do and What to Turn In

Your solution should work for ANY pair of words, including pairs for whom sample data has not been provided. In other words, you need to write a general-purpose solution. We'll be testing your solutions on 3-4 different word pairs.

You might want to create your own train and test sets by pulling out small subsets of the provided training and test sets.

Notice, that to be fair, you really should make THREE data sets: Create a TUNING set, and use it and your TRAIN set to evaluate and debug your designs, then when COMPLETELY DONE, apply your solutions to the TEST set. (I.e., you really should try to not "peak" [oops, make that "peek"] at the test set in advance.) However, we won't be this meticulous and will only use training and testing files.

Electronically turn in your HW3 code in the same way you turned in HW1's code. Also turn in a printout of your commented code and a neatly written lab report that presents and justifies each of your two designs for each of the two approaches, and describes how well they worked on each of the four word pairs provided: accept-except, among-between, good-well, and their-there. Please be thorough in your explanations in order to receive full credit. Additionally, include instructions on how to compile and run your code in your lab report.

Our TA, Nick, will post the answers to common questions from the class here as they come in.