Since it can be a little confusing following all their variations, here is a summary of the three "documents of English text" models we discussed in class this week. At the end is a simple worked example, to which I applied all three representations. You might want to think about, for each model, how one could use the sets of estimated prob's to stochastically (ie, randomly) generate sample documents. All three models miss many aspects of English, and so will generate non-sense documents, but thinking about how each model could be used to generate a document will help you understand the assumptions made in each probabilistic representation of documents. (I have proof read this 3-4 times, but there is still a chance of some typo's and even think-o's. Let me know if you find any. I'll link this message to the "Readings" for cs540 and will correct typo's in that file. I'll note any corrections at the top of that file.) Jude ------------ Changes since original email: none so far ------------ First let's define some variables: C = number of categories (we're only using + and -, so for us C = 2) V = vocabulary size (eg, 26 for our simple A-Z vocab, but more likely 10,000 to 500,000 for a real text corpus) L = max length (in # of words) of our training corpora, could be 1000s of words Approach 1 -------------- rand var i = does WORD i appear ANYWHERE in the doc? - i ranges from 1 to V, and each random variable has two values (ie, true/false) - the size of the full joint prob table is C x 2^V (advanced point: technically we need C fewer 'memory cells' than this, since we know the sum of the probs for each category must sum to 1, but that is negligible) - for Naive Bayes (NB) we only need C x V probabilities, eg, prob(word i appears | class = j) note: for NB we don't need to store "prob(word i is absent | class = j)" since that equals "1 - prob(word i appears | class = j)" Here NB is assuming that the presence of word i (eg, 'aardvark') is independent of which other words appear in the doc. Approach 2 -------------- rand var i = which word is at POSITION i? - i ranges from 1 to L, and each random variable has V possible values (ie, word1, ..., wordV) - the size of the full joint prob table is C x V^L, which most likely is very large (again, for each of the C categories, we could save one memory cell since we know they sum to 1) Notice here, in the FULL table, the prob of a word at position i is influenced by all the other positions. Eg, here is ONE cell in the table (assuming our vocabulary = A ... Z): prob(word@1 = A ^ word@2 = G ^ word@3=O ^ . . . ^ word@L=G ^ Category=true) ['word@N' is shorthand for "the Nth word in the document"] - for Naive Bayes we only need C x V x L probabilities, eg prob(word at position i = the kth word in the vocabulary | class = j) Here NB is assuming that the word appearing at position i is independent of the words appearing at all other positions in the document. Note that, even with the NB assumption, the probability distribution for the words at position i might be very different than the prob. dist. for the words at position j - eg, prob(word@1 = 'the' | +) might be quite high, while prob(word@2 = 'the' | +) might be much lower. Approach 3 -------------- this is a simplified version of Approach 2 as in Approach 2: rand var i = which word is at POSITION i? we SIMPLIFY "prob(word at position i = the kth word in the vocabulary | class = j)" by assuming that the probability distributions for ALL the positions are the same, which allows us to pool all our words together and better estimate probabilities (eg, even if we never saw 'the' as the 17th word in a document, we would still be able to estimate "prob(word@i = 'the' | +)" reasonably accurately). - the size of the full joint prob table is C x V (again, for each of the C categories, we could save one cell since we know they sum to 1, but probably easier to simply fill the full table) - the above is also the size for Naive Bayes, since we essentially made the Naive Bayes assumption when we said all the positions had the same probability distributions Example ----------- Let's assume we have these '+' docs (the same calc's would apply to whatever set of '-' docs we would be given): H E L L O G O O D B Y E Ie, three docs with a total of 12 words in them. In the following, I am letting some probs = 0, but as mentioned in class, we don't want to do that in practice. I'll discuss in class next Monday one easy way to avoid having any prob's equaling zero. Here are the prob's we'd estimate from this corpus of the three doc's above. Approach 1: prob(A present | +) = 0 ... prob(E present | +) = 2/3 ... prob(L present | +) = 1/3 // 'L' appears twice, but in the same doc. ... prob(Y present | +) = 1/3 prob(Z present | +) = 0 // Note: these 26 prob's DON'T sum to 1. (Why?) Approach 2: prob(word@1 = A | +) = 0 prob(word@1 = B | +) = 1/3 ... prob(word@1 = G | +) = 1/3 prob(word@1 = H | +) = 1/3 ... prob(word@1 = Y | +) = 0 prob(word@1 = Z | +) = 0 // Note: this group (A-Z), for word 1, would sum to 1. prob(word@2 = A | +) = 0 ... prob(word@2 = E | +) = 1/3 ... prob(word@2 = O | +) = 1/3 ... prob(word@2 = Y | +) = 1/3 prob(word@2 = Z | +) = 0 // Note: this group (A-Z), for word 2, would sum to 1. ... prob(word@5 = O | +) = 1 ... prob(word@5 = Y | +) = 0 prob(word@5 = Z | +) = 0 // These 26 probs would also sum to 1. Approach 3: here we assume prob(word@i = word k | +) = prob(word k at any position | +), so we'd only have to store these 26 probs: prob(word = A | +) = 0 prob(word = B | +) = 1/12 ... prob(word = D | +) = 1/12 prob(word = E | +) = 2/12 ... prob(word = G | +) = 1/12 prob(word = H | +) = 1/12 ... prob(word = L | +) = 2/12 ... prob(word = O | +) = 3/12 ... prob(word = Y | +) = 1/12 prob(word = Z | +) = 0 // These would all sum to 1 (so prob for all of the ...'s is 0). This is a "unigram" model - on Monday, we'll talk about n-gram models (bigrams [pairs of adjacent words] and trigrams [triples of consecutive words] in particular). Also Approach 3 is typically called a 'bag of words' model since we simply dumped all the words in a bag and counted each's frequency. (Aside: technically a 'bag' is a set where duplicates can occur.) Judging a Sample 'Testset' Document ---------------------------------------------- Lastly, how does each model estimate prob('A A C E' | +)? (this prob is the key one in the equation: prob(+ | 'A A C E') = p('A A C E' | +) x p(+) / p('A A C E'), ie, Baye's rule applied to classifying doc 'A A C E'). Approach 1 Recall here the doc "A A C E" is rep'ed as [T, F, T, F, T, F, ..., F], where each element in this 26-vector array says whether or not the ith 'word' appears in this document prob('A A C E' | +) = prob(A|+) x (1 - prob(B|+)) x prob(C|+) x (1 - prob(D|+)) x prob(E|+) x (1 - prob(F|+) x ... Approach 2 prob('A A C E' | +) = prob(word@1='A'|+) x prob(word@2='A'|+) x prob(word@3='C'|+) x prob(word@4='E') Approach 3 prob('A A C E' | +) = prob(word='A'|+) x prob(word='A'|+) x prob(word='C'|+) x prob(word='E')