Young Wu's Homepage

Prev: L1, Next: L2
Course Links: Canvas, Piazza, TopHat (212925)
Zoom Links: MW 4:00, TR 1:00, TR 2:30.

Tools

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 For visibility, you can resize all diagrams on page to have a maximum height that is percent of the screen height: .

📗 Calculator:

📗 Canvas:

Slide:

# Warning: this is a draft and will be updated one day before the lecture.

# Welcome to CS540

📗 The main components of the course include:

➩ Assignments and Projects (HW1-8 and CP1-4): 60%

➩ Exams: 40%

📗 *NEW* Competitive Projects (CP1-4):

➩ Participate in 2 or more competitions (out of 4 choices).

➩ Create your own training dataset, network architecture, and training algorithm (with strategic considerations).

➩ Strict deadline and format to submit the neural networks you trained.

➩ Grades based on ranking in class: in the past, the grades are curved at the end, and the assignment averages are close to perfect, so the ranking is effectively only based on the exams. Ranking-based assignments starting this summer will shift some of the weights from the exams back to the assignments, which is the more important part of the course.

📗 *NEW* Use of LLMs:

➩ Students are encouraged to generate code for assignments and projects and solve exam questions using Large Language Models (LLMs).

➩ Remember to give attribution and provide the prompts.

TopHat Discussion

📗 Why are you taking the course?

➩ Learn how to use AI tools like ChatGPT? This is not covered in the course.

➩ Learn how to program AI tools like ChatGPT? Only simple models.

➩ Learn the math and statistics behind AI algorithms? Yes, this is the focus of the course.

# ChatGPT

📗 GPT stands for Generative Pre-trained Transformer.

➩ Unsupervised learning (convert text to numerical vectors).

➩ Supervised learning: (1) discriminative (predict answers based on questions), (2) generative (predict next word based on previous word).

➩ Reinforcement learning (update model based on human feedback).

TopHat Discussion

📗 Have you used ChatGPT (or another Large Language Model)? What did you use LLM for?

➩ Solve homework or exam questions? For CS540, it is possible with some prompt engineering: Link.

➩ Write code for projects? For CS540, it is allowed and encouraged you use large language models (LLMs) to help with writing code (at the moment, most of LLMs cannot write complete projects).

➩ Write stories or create images? In the past, there were CS540 assignments asking students to use earlier versions of GPT to perform these tasks and compare the results with human creations.

➩ Other uses?

# N-gram Model

📗 A sentence is a sequence of words (tokens). Each unique word token is called a word type. The set of word types of called the vocabulary.

📗 A sentence with length \(d\) can be represented by \(\left(w_{1}, w_{2}, ..., w_{d}\right)\).

📗 The probability of observing a word \(w_{t}\) at position \(t\) of the sentence can be written as \(\mathbb{P}\left\{w_{t}\right\}\) (or in statistics \(\mathbb{P}\left\{W_{t} = w_{t}\right\}\)).

📗 N-gram model is a language model that assumes the probability of observing a word \(w_{t}\) at position \(t\) only depends on the words at positions \(t-1, t-2, ..., t-N+1\). In statistics notation, \(\mathbb{P}\left\{w_{t} | w_{t-1}, w_{t-2}, ..., w_{0}\right\} = \mathbb{P}\left\{w_{t} | w_{t-1}, w_{t-2}, ..., w_{t-N+1}\right\}\) (the \(|\) is pronounced as "given", \(\mathbb{P}\left\{a | b\right\}\) is "the probability of a given b".

# Unigram Model

📗 Unigram model assumes independence (not a realistic language model): \(\mathbb{P}\left\{w_{1}, w_{2}, ..., w_{d}\right\} = \mathbb{P}\left\{w_{1}\right\} \mathbb{P}\left\{w_{2}\right\} ... \mathbb{P}\left\{w_{d}\right\}\).

📗 Independence means \(\mathbb{P}\left\{w_{t} | w_{t-1}, w_{t-2}, ... w_{0}\right\} = \mathbb{P}\left\{w_{t}\right\}\) or the probability of observing \(w_{t}\) is independent of all previous words.

# Maximum Likelihood Estimation

📗 Given a training set (many sentences or text documents), \(\mathbb{P}\left\{w_{t}\right\}\) is estimated by \(\hat{\mathbb{P}}\left\{w_{t}\right\} = \dfrac{c_{w_{t}}}{c_{1} + c_{2} + ... + c_{m}}\), where \(m\) is the size of the vocabulary (and the vocabulary is \(\left\{1, 2, ..., m\right\}\)), and \(c_{w_{t}}\) is the number of times the word \(w_{t}\) appeared in the training set.

📗 This is called the maximum likelihood estimator because it maximizes the likelihood (probability) of observing the sentences in the training set.

Math Note

📗 Suppose the vocabulary is \({a, b}\), and \(p_{1} = \hat{\mathbb{P}}\left\{a\right\}, p_{2} = \hat{\mathbb{P}}\left\{b\right\}\) with \(p_{1} + p_{2} = 1\) based on the a training set with \(c_{1}\) number of \(a\)'s and \(c_{2}\) number of \(b\)'s. Then the probability of observing the sentence is \(\dbinom{c_{1} + c_{2}}{c_{1}} p_{1}^{c_{1}} p_{2}^{c_{2}}\), which is maximized at \(p_{1} = \dfrac{c_{1}}{c_{1} + c_{2}}, p_{2} = \dfrac{c_{2}}{c_{1} + c_{2}}\).

TopHat Quiz

📗 [1 points] Given a training set (the script of "Guardians of the Galaxy" for Vin Diesel: Wikipedia), "I am Groot, I am Groot, ... (13 times), ..., I am Groot, We are Groot". What is the maximum likelihood estimates of the unigram model based on this training set? What is the probability of observing a new sentence "I am Groot" based on the estimated gram model?

📗 Click to generate the next word:

📗 Answer:

# Bigram Model

📗 Bigram model assumes Markov property: \(\mathbb{P}\left\{w_{1}, w_{2}, ..., w_{d}\right\} = \mathbb{P}\left\{w_{1}\right\} \mathbb{P}\left\{w_{2} | w_{1}\right\} \mathbb{P}\left\{w_{3} | w_{2}\right\} ... \mathbb{P}\left\{w_{d} | w_{d-1}\right\}\).

📗 Markov property means \(\mathbb{P}\left\{w_{t} | w_{t-1}, w_{t-2}, ... w_{0}\right\} = \mathbb{P}\left\{w_{t} | w_{t-1}\right\}\) or the probability distribution of observing \(w_{t}\) only depends on the previous word in the sentence \(w_{t-1}\). A visualiation of Markov chains: Link.

📗 The maximum likelihood estimator of \(\mathbb{P}\left\{w_{t} | w_{t-1}\right\}\) is \(\hat{\mathbb{P}}\left\{w_{t} | w_{t-1}\right\} = \dfrac{c_{w_{t-1} w_{t}}}{c_{w_{t-1}}}\), where \(c_{w_{t-1} w_{t}}\) is the number of times the phrase (sequence of words) \(w_{t-1} w_{t}\) appeared in the training set.

Math Note

📗 Conditional probability is defined as \(\mathbb{P}\left\{a | b\right\} = \dfrac{\mathbb{P}\left\{a, b\right\}}{\mathbb{P}\left\{b\right\}}\), so \(\mathbb{P}\left\{w_{t} | w_{t-1}\right\} = \dfrac{\mathbb{P}\left\{w_{t-1} w_{t}\right\}}{\mathbb{P}\left\{w_{t-1}\right\}}\), where \(\mathbb{P}\left\{w_{t-1} w_{t}\right\}\) is the probability of observing the phrase \(w_{t-1} w_{t}\).

TopHat Quiz

📗 Click to generate the next word:

📗 Answer:

# Transition Matrix

📗 The bigram probabilities can be stored in a matrix called the transition matrix of a Markov chain. The number in row \(i\) column \(j\) is the probability \(\mathbb{P}\left\{j | i\right\}\) or the estimated probability \(\hat{\mathbb{P}}\left\{j | i\right\}\): Link.

📗 Given the initial distribution of word types, the distribution of the next token can be found by multiplying the transition matrix by the initial distribution.

📗 The stationary distribution of a Markov chain is an initial distribution such that all subsequent distributions will be the same as the initial distribution, which means if the transition matrix is \(M\), then the stationary distribution is a distribution \(p\) satisfying \(p M = p\).

Math Note

📗 An alternative way to compute the stationary distribution (if it exists) is by starting with any initial distribution \(p_{0}\) and multiply it by \(M\) infinite number of times (that is \(p^\top_{0} M^{\infty}\)).

📗 It is easier to find powers of diagonal matrices, so if the transition matrix can be written as \(M = P D P^{-1}\) where \(D\) is a diagonal matrix (off-diagonal entries are 0, diagonal entries are called eigenvalues), and \(P\) is the matrix where the columns are eigenvectors, then \(M^{\infty} = \left(P D P^{-1}\right)\left(P D P^{-1}\right) ... = P D^{\infty} P^{-1}\).

# Trigram Model

📗 The same formula can be applied to trigram models: \(\hat{\mathbb{P}}\left\{w_{t} | w_{t-1}, w_{t-2}\right\} = \dfrac{c_{w_{t-2} w_{t-1} w_{t}}}{c_{w_{t-2} w_{t-1}}}\).

📗 In a document, some longer sequences of tokens never appear, for example, when \(w_{t-2} w_{t-1}\) never appears, the maximum likelihood estimator \(\hat{\mathbb{P}}\left\{w_{t} | w_{t-1}, w_{t-2}\right\}\) will be \(\dfrac{0}{0}\) and undefined. As a result, Laplace smoothing (add-one smoothing) is often used: \(\hat{\mathbb{P}}\left\{w_{t} | w_{t-1}, w_{t-2}\right\} = \dfrac{c_{w_{t-2} w_{t-1} w_{t}} + 1}{c_{w_{t-2} w_{t-1}} + m}\), where \(m\) is the number of unique words in the document.

📗 Laplace smoothing can be used for bigram and unigram models too: \(\hat{\mathbb{P}}\left\{w_{t} | w_{t-1}\right\} = \dfrac{c_{w_{t-1} w_{t}} + 1}{c_{w_{t-1}} + m}\) for bigram and \(\hat{\mathbb{P}}\left\{w_{t}\right\} = \dfrac{c_{w_{t}} + 1}{c_{1} + c_{2} + ... + c_{m} + m}\) for unigram.

# N-gram Model

📗 In general N-gram probabilities can be estimated in a similar way: Wikipedia.

➩ Without smoothing: \(\hat{\mathbb{P}}\left\{w_{t} | w_{t-1}, w_{t-2}, ..., w_{t-N+1}\right\} = \dfrac{c_{w_{t-N+1}, w_{t-N+2}, ..., w_{t}}}{c_{w_{t-N+1}, w_{t-N+2}, ..., w_{t-1}}}\).

➩ With add-one Laplace smoothing: \(\hat{\mathbb{P}}\left\{w_{t} | w_{t-1}, w_{t-2}, ..., w_{t-N+1}\right\} = \dfrac{c_{w_{t-N+1}, w_{t-N+2}, ..., w_{t}} + 1}{c_{w_{t-N+1}, w_{t-N+2}, ..., w_{t-1}} + m}\), where \(m\) is the number of unique words in the document.

➩ With general Laplace smoothing with parameter \(\delta\): \(\hat{\mathbb{P}}\left\{w_{t} | w_{t-1}, w_{t-2}, ..., w_{t-N+1}\right\} = \dfrac{c_{w_{t-N+1}, w_{t-N+2}, ..., w_{t}} + \delta}{c_{w_{t-N+1}, w_{t-N+2}, ..., w_{t-1}} + \delta m}\).

TopHat Discussion

📗 Go to the Google Books Ngram Viewer: Link. Find a phrase with N-gram probability that is decreasing over time.

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yingyu Liang, and Charles Dyer.

📗 Content from note blocks marked "optional" and content from Wikipedia and other demo links are helpful for understanding the materials, but will not be explicitly tested on the exams.

📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.

📗 Anonymous feedback can be submitted to: Form.

Prev: L1, Next: L2

Last Updated: September 11, 2025 at 10:54 PM