Wisc ID for in-class quiz: (if your wisc email is "test@wisc.edu", please enter "test")
Token: (will be given during the lectures) 9 id,answer_id;token,answer_check
# Warning: this is a draft and will be updated one day before the lecture.
📗 A sentence is a sequence of words (tokens). Each unique word token is called a word type. The set of word types of called the vocabulary.
📗 A sentence with length \(d\) can be represented by \(\left(w_{1}, w_{2}, ..., w_{d}\right)\).
📗 The probability of observing a word \(w_{t}\) at position \(t\) of the sentence can be written as \(\mathbb{P}\left\{w_{t}\right\}\) (or in statistics \(\mathbb{P}\left\{W_{t} = w_{t}\right\}\)).
📗 N-gram model is a language model that assumes the probability of observing a word \(w_{t}\) at position \(t\) only depends on the words at positions \(t-1, t-2, ..., t-N+1\). In statistics notation, \(\mathbb{P}\left\{w_{t} | w_{t-1}, w_{t-2}, ..., w_{0}\right\} = \mathbb{P}\left\{w_{t} | w_{t-1}, w_{t-2}, ..., w_{t-N+1}\right\}\) (the \(|\) is pronounced as "given", \(\mathbb{P}\left\{a | b\right\}\) is "the probability of a given b".
📗 Unigram model assumes independence (not a realistic language model): \(\mathbb{P}\left\{w_{1}, w_{2}, ..., w_{d}\right\} = \mathbb{P}\left\{w_{1}\right\} \mathbb{P}\left\{w_{2}\right\} ... \mathbb{P}\left\{w_{d}\right\}\).
📗 Independence means \(\mathbb{P}\left\{w_{t} | w_{t-1}, w_{t-2}, ... w_{0}\right\} = \mathbb{P}\left\{w_{t}\right\}\) or the probability of observing \(w_{t}\) is independent of all previous words.
📗 Given a training set (many sentences or text documents), \(\mathbb{P}\left\{w_{t}\right\}\) is estimated by \(\hat{\mathbb{P}}\left\{w_{t}\right\} = \dfrac{c_{w_{t}}}{c_{1} + c_{2} + ... + c_{m}}\), where \(m\) is the size of the vocabulary (and the vocabulary is \(\left\{1, 2, ..., m\right\}\)), and \(c_{w_{t}}\) is the number of times the word \(w_{t}\) appeared in the training set.
📗 This is called the maximum likelihood estimator because it maximizes the likelihood (probability) of observing the sentences in the training set.
Math Note
📗 Suppose the vocabulary is \({a, b}\), and \(p_{1} = \hat{\mathbb{P}}\left\{a\right\}, p_{2} = \hat{\mathbb{P}}\left\{b\right\}\) with \(p_{1} + p_{2} = 1\) based on the a training set with \(c_{1}\) number of \(a\)'s and \(c_{2}\) number of \(b\)'s. Then the probability of observing the sentence is \(\dbinom{c_{1} + c_{2}}{c_{1}} p_{1}^{c_{1}} p_{2}^{c_{2}}\), which is maximized at \(p_{1} = \dfrac{c_{1}}{c_{1} + c_{2}}, p_{2} = \dfrac{c_{2}}{c_{1} + c_{2}}\).
In-class Quiz
📗 [1 points] Given a training set (the script of "Guardians of the Galaxy" for Vin Diesel: Wikipedia), "I am Groot, I am Groot, ... (13 times), ..., I am Groot, We are Groot". What is the maximum likelihood estimates of the unigram model based on this training set? What is the probability of observing a new sentence "I am Groot" based on the estimated gram model?
📗 Markov property means \(\mathbb{P}\left\{w_{t} | w_{t-1}, w_{t-2}, ... w_{0}\right\} = \mathbb{P}\left\{w_{t} | w_{t-1}\right\}\) or the probability distribution of observing \(w_{t}\) only depends on the previous word in the sentence \(w_{t-1}\). A visualiation of Markov chains: Link.
📗 The maximum likelihood estimator of \(\mathbb{P}\left\{w_{t} | w_{t-1}\right\}\) is \(\hat{\mathbb{P}}\left\{w_{t} | w_{t-1}\right\} = \dfrac{c_{w_{t-1} w_{t}}}{c_{w_{t-1}}}\), where \(c_{w_{t-1} w_{t}}\) is the number of times the phrase (sequence of words) \(w_{t-1} w_{t}\) appeared in the training set.
Math Note
📗 Conditional probability is defined as \(\mathbb{P}\left\{a | b\right\} = \dfrac{\mathbb{P}\left\{a, b\right\}}{\mathbb{P}\left\{b\right\}}\), so \(\mathbb{P}\left\{w_{t} | w_{t-1}\right\} = \dfrac{\mathbb{P}\left\{w_{t-1} w_{t}\right\}}{\mathbb{P}\left\{w_{t-1}\right\}}\), where \(\mathbb{P}\left\{w_{t-1} w_{t}\right\}\) is the probability of observing the phrase \(w_{t-1} w_{t}\).
In-class Quiz
📗 [1 points] Given a training set (the script of "Guardians of the Galaxy" for Vin Diesel: Wikipedia), "I am Groot, I am Groot, ... (13 times), ..., I am Groot, We are Groot". What is the maximum likelihood estimates of the unigram model based on this training set? What is the probability of observing a new sentence "I am Groot" based on the estimated gram model?
📗 The bigram probabilities can be stored in a matrix called the transition matrix of a Markov chain. The number in row \(i\) column \(j\) is the probability \(\mathbb{P}\left\{j | i\right\}\) or the estimated probability \(\hat{\mathbb{P}}\left\{j | i\right\}\): Link.
📗 Given the initial distribution of word types, the distribution of the next token can be found by multiplying the transition matrix by the initial distribution.
📗 The stationary distribution of a Markov chain is an initial distribution such that all subsequent distributions will be the same as the initial distribution, which means if the transition matrix is \(M\), then the stationary distribution is a distribution \(p\) satisfying \(p M = p\).
Math Note
📗 An alternative way to compute the stationary distribution (if it exists) is by starting with any initial distribution \(p_{0}\) and multiply it by \(M\) infinite number of times (that is \(p^\top_{0} M^{\infty}\)).
📗 It is easier to find powers of diagonal matrices, so if the transition matrix can be written as \(M = P D P^{-1}\) where \(D\) is a diagonal matrix (off-diagonal entries are 0, diagonal entries are called eigenvalues), and \(P\) is the matrix where the columns are eigenvectors, then \(M^{\infty} = \left(P D P^{-1}\right)\left(P D P^{-1}\right) ... = P D^{\infty} P^{-1}\).
📗 The same formula can be applied to trigram models: \(\hat{\mathbb{P}}\left\{w_{t} | w_{t-1}, w_{t-2}\right\} = \dfrac{c_{w_{t-2} w_{t-1} w_{t}}}{c_{w_{t-2} w_{t-1}}}\).
📗 In a document, some longer sequences of tokens never appear, for example, when \(w_{t-2} w_{t-1}\) never appears, the maximum likelihood estimator \(\hat{\mathbb{P}}\left\{w_{t} | w_{t-1}, w_{t-2}\right\}\) will be \(\dfrac{0}{0}\) and undefined. As a result, Laplace smoothing (add-one smoothing) is often used: \(\hat{\mathbb{P}}\left\{w_{t} | w_{t-1}, w_{t-2}\right\} = \dfrac{c_{w_{t-2} w_{t-1} w_{t}} + 1}{c_{w_{t-2} w_{t-1}} + m}\), where \(m\) is the number of unique words in the document.
📗 Laplace smoothing can be used for bigram and unigram models too: \(\hat{\mathbb{P}}\left\{w_{t} | w_{t-1}\right\} = \dfrac{c_{w_{t-1} w_{t}} + 1}{c_{w_{t-1}} + m}\) for bigram and \(\hat{\mathbb{P}}\left\{w_{t}\right\} = \dfrac{c_{w_{t}} + 1}{c_{1} + c_{2} + ... + c_{m} + m}\) for unigram.
➩ With add-one Laplace smoothing: \(\hat{\mathbb{P}}\left\{w_{t} | w_{t-1}, w_{t-2}, ..., w_{t-N+1}\right\} = \dfrac{c_{w_{t-N+1}, w_{t-N+2}, ..., w_{t}} + 1}{c_{w_{t-N+1}, w_{t-N+2}, ..., w_{t-1}} + m}\), where \(m\) is the number of unique words in the document.
➩ With general Laplace smoothing with parameter \(\delta\): \(\hat{\mathbb{P}}\left\{w_{t} | w_{t-1}, w_{t-2}, ..., w_{t-N+1}\right\} = \dfrac{c_{w_{t-N+1}, w_{t-N+2}, ..., w_{t}} + \delta}{c_{w_{t-N+1}, w_{t-N+2}, ..., w_{t-1}} + \delta m}\).
In-class Discussion
📗 Go to the Google Books Ngram Viewer: Link. Find a phrase with N-gram probability that is decreasing over time.
📗 If you have questions, please use (i) Zoom chat, (ii) Piazza: Link, (iii) Office hours and discussion sessions. Please do NOT use Canvas mail and use email only to the course instructor (not TAs) for grading issues.
Additional In-class Discussion
📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:
Notes (not visible to other students):
Submit your answer to see other students answers (click the submit button to refresh):
Additional In-class Quiz
📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:
A.
B.
C.
D.
E.
Notes (not visible to other students):
Submit your answer to see other students answers (click the submit button to refresh):
📗 To get full points on the in-class quizzes for a lecture:
➩ Submit relevant answers to the questions discussed during the lecture: incorrect answers are okay.
➩ Some questions require [notes] to earn the point.
➩ Some questions require special ID (given during the lecture) to earn the point.
➩ Do not submit answers to questions that are not discussed during the lectures. Each such submission will result in a deduction of one point.
➩ Submissions after the lecture, before the midterm (first 14 lectures) and the final exam (last 14 lectures), are accepted. After the exams, no in-class quiz submissions will be accepted.
➩ The grade on Canvas Assignment Q9 is computed as number of points divided by the number of questions asked (out of 1) and updated on Canvas every weekend.
📗 If there are any issues with submission on the website, please use this Google form: Link.
📗 Bonus point opportunities during a few lectures (added to in-class quiz above 20 points).
📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Blerina Gkotse, Yudong Chen, Yingyu Liang, Charles Dyer. Some content are generated using Copilot .