CS540 HW3: Probabilistic Reasoning, Case-Based Reasoning, and Natural Language Processing

Assigned: 3/31/08
Due: 4/18/08
Value: 150 points

(Problems 1-5 are paper-and-pencil ones. Problems 2 and 5 in particular are intended to help get you up to speed on the algorithms underlying the programming portions of this homework.)

Problem 1: Full Joint Probability Distributions (15 points)

Consider this full joint probability distribution involving four Boolean-valued random variables (A-D):

  A   B   C   D     Prob

  F   F   F   F     0.10
  F   F   F   T     0.01
  F   F   T   F     0.05
  F   F   T   T     0.15

  F   T   F   F     0.02
  F   T   F   T     0.03
  F   T   T   F     0.04
  F   T   T   T     0.05

  T   F   F   F     0.20
  T   F   F   T     0.01
  T   F   T   F     0.01
  T   F   T   T     0.03

  T   T   F   F     0.02
  T   T   F   T     0.04
  T   T   T   F     0.08
  T   T   T   T      ?
  1. Compute P(A = true and B = true and C = true and D = true).

  2. Compute P(A = false | B = true and C = true and D = false).

  3. Compute P(B = false | A = false and D = true).

  4. Compute P(B = false).

  5. Compute P(A = false or B = true | C = true or D = true).

Problem 2: Bayes' Rule and Text Processing (20 points)

Define the following two variables about people:
      wh         =  washes hands regularly
      cc         =  caught cold
Assume we know from past experience that:
      P(wh)      =  0.75
      P(cc)      =  0.25
      P(cc | wh) =  0.10

(a) What's P(cc | ~wh)? ('~' means 'NOT')

(b) Given you find out someone has a cold, what's the probability they regularly wash their hands?

Be sure to show and explain your calculations for both parts (a) and (b).

(c) Margaret has divided her books into two groups, those she likes and those she doesn't.

      The five (5) books that Margaret likes contain (only) the following words:

	animal (7 times), mineral (8 times), vegetable (5 times), see (3 times)

      The ten (10) books that Margaret does not like contain (only) the following words:

	animal (4 times), mineral (1 times), vegetable (6 times), spot (2 time), run (9 times)

Using the Naive Bayes assumption, determine whether it is more probable that Margaret likes the following book than that she dislikes it. Again, be sure to show and explain your work.

	see spot spot animal run	 	// These words are the entire contents of this new book.

 

Problem 3 - Representing Probability Distributions (15 points)

Assume the task at hand involves 26 Boolean-valued random variables, which we'll name A through Z.

  1. How big of a table (number of memory cells) would be need to explicitly represent the full, joint probability distribution over every possible combination of our 26 Boolean-valued random variables?

  2. How big of a table would we need if we make the independence assumption that each variable is independent of all other variables?

  3. This time assume that we have a Bayesian network where the following nodes have the parents listed (if a node is not listed, then it has no parents in the Bayesian network):

Draw this Bayesian network and next to each node report how many cells are needed to store that node's conditional probability table (CPT). Explain your answer (you need not explain your answer within your drawing of the Bayesian network - it is fine to place your explanation below your drawing). Finally, report the total number of cells needed to store this Bayesian network (be sure to count the memory needed to store the parent links - assume each link uses the same number of bytes as one cell in a probability table, ie, each parent link counts 1).

Hint: when you answer all three parts of this question, remember that if you store the value for, say, Prob(C=true | A=true and B=true) you do not need to also store the value for Prob(C=false | A=true and B=true) since we know that these two probabilities sum to one.

Problem 4 - Bayesian Networks (15 points)

Consider the Bayesian network drawn below.

Bayesian Network for Problem 4 of HW3

(You can print out this image directly. It is at /p/course/cs540-shavlik/public/HWs/HW3_Q4_BN.jpg)

Show your work for the following calculations.

  1. Compute P(A = true and B = false and C = true and D = false).

  2. Compute P(D = true | A = false and B = true and C = false).

  3. Compute P(A = true | B = false and C = true and D = false).

  4. Compute P(B = false | A = true and C = false).

  5. Compute P(B = false).

Problem 5: Case-Based Reasoning (10 points)

Imagine that we have the following examples, represented using three Boolean-valued features:
    F1 = false  F2 = false  F3 = false    category = +
    F1 = false  F2 = false  F3 = true     category = -
    F1 = false  F2 = true   F3 = false    category = +
    F1 = true   F2 = true   F3 = true     category = +
    F1 = true   F2 = false  F3 = false    category = -
    F1 = true   F2 = false  F3 = true     category = +

Assume we have the following test set example, whose category we wish to estimate based on a case-based approach where we find the three (3) nearest neighbors and use the most common category among them as our estimate:

    F1 = false   F2 = true   F3 = true   category = ?
Propose and apply a (simple) similarity function to use on this problem.

Problem 6: Creating a "Semantic" Spelling Checker (75 points)

In the remainder of this homework, you will design, implement, and test two approaches for addressing the following task.

Given

For various pairs of often confused words, such as affect versus effect, we have collected sample phrases that include these words; we assume that these sample uses are correct.

The collected pairs are available at:

     http://www.cs.wisc.edu/~shavlik/cs540/semanticSpellCheck/
     (or /p/course/cs540-shavlik/public/semanticSpellCheck/ in AFS)
The specific word pairs you will be using are: affect vs. effect, right vs. write, and too vs. two.

For each word pair there is a TRAIN set and a TEST set. Important: the train and test files have different organization, as explained here.

Do

Call the chosen paired words X and Y and consider "testset" sentences containing either word X or Y.

Using the training data in the collections of (presumed) proper usage, determine which of X or Y is more likely the proper usage in each of the test sentences. In other words, our goal is to identify likely usage errors, which we'll do as follows: whenever we see word X in a phrase, we'll see if using word Y instead leads to a more plausible phrase and when we see word Y we'll see if using word X instead looks better.

Additional Details

The sentences are augmented with additional linguistic knowledge. Eric Brill's part-of-speech (POS) tagger was applied to label each word in its context, as a noun, verb, adjective, etc. The table of POS codes used to label the words in the sample phrases is available at:
     http://www.cs.wisc.edu/~shavlik/cs540/semanticSpellCheck/tag-meanings.txt
     (or /p/course/cs540-shavlik/public/semanticSpellCheck/tag-meanings.txt in AFS)

Also, the Porter Stemmer was used to extract the "root" version of each word (e.g.,walk, walks, walked, and walking all get "stemmed" to WALK).

A phrase containing a word of interest is defined as follows:

     the seven (7) words (tagged and stemmed) BEFORE the word X or Y

     the center word, ie either word X or Y
     (the center word is also tagged and stemmed)

     the seven (7) words (tagged and stemmed) AFTER the word X or Y
Punctuation counts as one of the 7 before/after words.

Here is a sample phrase, though it uses only 3 before/after words for simplicity:

  might [ MD MIGHT ] take [ VB TAKE ] to [ TO TO ] effect [ VB EFFECT ] it [ PRP IT ] , [* ,] but [ CC BUT ] 

More information on the construction of these annotated data files is available at:

     http://www.cs.wisc.edu/~shavlik/cs540/semanticSpellCheck/notationInDataFiles.txt
     (or /p/course/cs540-shavlik/public/semanticSpellCheck/notationInDataFiles.txt in AFS)

The phrases were automatically extracted from some on-line novels whose copyright has expired (some of the English might be a bit archaic).

The relative probabilities of the various word pairs were estimated from the following counts obtained by querying a search engine some number of years ago (we are holding back some word-pairs for possible use during grading):

     accept:  1,030,205    except:  1,365,015 
     affect:    562,435    effect:  1,771,070 
     right:  10,662,780    write:   4,559,945 
     there:  20,287,830    their:  17,427,090
     too:     4,629,770    two:    16,896,825 

Case-Based Reasoning

Design, justify, implement, and evaluate TWO (2) reasonably different "similarity" functions for finding the K nearest neighbors to a test phrase. I.e., this function should compare the test phrase to all the known phrases (for the word-pair in question), find the K most similar previous cases, then output the most likely word for the center of the test phrase.

You may use any (or none) of the annotation, the relative probabilities of the two words, some or all of the seven preceding and subsequent words, information combined over all of the words on the left or right of the center word, etc. in your similarity functions. For example, you might add 5 points to the similarity score if the word immediately to the left of the center position is an exact match in the two phrases; 3 points if the aren't an exact match but have the same root (i.e., stemmed) version; 2 if they are the same part of speech; and 1 if they are not the identical part of speech, but are related (e.g., proper noun vs. regular noun).

You may choose whichever K you wish; though you might wish to pick an odd-valued one - that way you can simply take a majority vote of the nearest neighbors when categorizing a test phrase (if you wish, you can also use some sort of weighting scheme to combine the answers in the K nearest neighbors, eg, weight "votes" by each neighbor's distance to the test phrase).

The intent is that you be creative in your design of your solution. Points for "creativity" and "effort" will be awarded during grading.

Calling Conventions

You should write a Java class with the following calling conventions:
      java CBR wordX wordY fractionXoverY fileOfTrainingCases fileOfTestPhrases
where
     wordX               one of the two words in the center of phrases
     wordY               the other one
     fractionXoverY      a floating point number, the ratio of
                         occurrences of wordX to that of wordY
     fileOfTrainingCases the name of the file containing the "training" phrases
     fileOfTestPhrases   the name of the file containing the "testing"  phrases

Your program should output, both for design #1 and design #2, the phrases on which your solutions get the wrong answers, as well as the overall fraction correct on the test phrases. It should also output the number of training phrases read and the number of testing phrases read.

Remember: the test file will NOT be in the same format as the training file. Your code should, for each pair of phrases in the test files, predict the correct center word. Then see if the prediction matches the correct word in the test phrase. This is essentially the same experimental design we used when applying the decision-tree algorithm to the HW1 testsets.

You probably will want to write code that reads the contents of a file into a string(see getFileContents in /p/course/cs540-shavlik/public/HWs/Utils.java), and then uses a StringTokenizer to "walk" through each of the "words" (ie, words and punctuation symbols) in the the string.

Bayesian Reasoning

Repeat the above, though this time instead of creating two similarity functions, devise, justify, and compare two (2) reasonably different designs for applying the Naive Bayes algorithm.

The calling convention should be the same, except instead of CBR use naiveBayes.

Experimentation to Do and What to Turn In

Your solution should work for ANY pair of words, including pairs for whom sample data has not been provided. In other words, you need to write a general-purpose solution. We'll be testing your solutions on 1-2 different word pairs.

You might want to create your own train and test sets by pulling out small subsets of the provided training and test sets.

Notice, that to be fair, you really should make THREE data sets: Create a TUNING set, and use it and your TRAIN set to evaluate and debug your designs, then when COMPLETELY DONE, apply your solutions to the TEST set. (I.e., you really should try to not "peak" [oops, make that "peek"] at the test set in advance.) However, we won't be this meticulous and will only use training and testing files.

Electronically turn in your HW3 code in the same way you turned in HW1's code. Also turn in a printout of your commented code and a neatly written lab report that presents and justifies each of your two designs for each of the two approaches, and describes how well they worked on each of the three word pairs provided, affect-effect, right-write, and too-two.