Young Wu's Homepage

Prev: L1, Next: L3
Course Links: Canvas, Piazza, TopHat (212925)
Zoom Links: MW 4:00, TR 1:00, TR 2:30.

Tools

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 For visibility, you can resize all diagrams on page to have a maximum height that is percent of the screen height: .

📗 Calculator:

📗 Canvas:

Slide:

# Natural Language Processing

📗 When processing language data, documents need to be first turned into sequences of word tokens.

➩ Split the string by space and punctuation.

➩ Remove stop-words such as "the", "of", "a", "with".

➩ Lower case all characters.

➩ Stemming or lemmatization words: change "looks", "looked", "looking" to "look".

📗 Each document needs to be converted into a numerical vector for supervised learning tasks.

➩ Bag of words feature uses the number of occurrences of each word type: Wikipedia.

➩ Term-Frequency Inverse-Document-Frequency (TF-IDF) feature adjusts for whether each word type appears in multiple documents: Wikipedia.

# Bag of Words Feature

📗 Given a document \(i \in \left\{1, 2, ..., n\right\}\) and vocabulary with size \(m\), let \(c_{ij}\) be the number of times word \(j \in \left\{1, 2, ..., m\right\}\) appears in the document \(i\), the bag of words representation of document \(i\) is \(x_{i} = \left(x_{i 1}, x_{i 2}, ..., x_{i m}\right)\), where \(x_{ij} = \dfrac{c_{ij}}{c_{i 1} + c_{i 2} + ... + c_{i m}}\).

📗 Sometimes, the features are not normalized, meaning \(x_{ij} = c_{ij}\).

# TF IDF Features

📗 Term frequency is defined the same way as in the bag of words features, \(T F_{ij} = \dfrac{c_{ij}}{c_{i 1} + c_{i 2} + ... + c_{i m}}\).

📗 Inverse document frequency is defined as \(I D F_{j} = \log \left(\dfrac{n}{\left| \left\{i : c_{ij} > 0\right\} \right|}\right)\), where \(\left| \left\{i : c_{ij} > 0\right\} \right|\) is the number of documents that contain word \(j\).

📗 TF IDF representation of document \(i\) is \(x_{i} = \left(x_{i 1}, x_{i 2}, ..., x_{i m}\right)\), where \(x_{ij} = T F_{ij} \cdot I D F_{j}\).

TopHat Quiz

📗 [1 points] Given three documents "Guardians of the Galaxy", "Guardians of the Galaxy Vol. 2", "Guardians of the Galaxy Vol. 3", compute the bag of words features and the TF-IDF features of the 3 documents.

Document	Phrase	Number of times
"Guardians of the Galaxy"	"I am Groot"	13
-	"We are Groot"	1
"Guardians of the Galaxy Vol. 2"	"I am Groot"	17
"Guardians of the Galaxy Vol. 3"	"I am Groot"	13
-	"I love you guys"	1

📗 Answer:

# Supervised Learning Tasks

📗 If the documents are labeled, then a supervised learning task is: given a training set of document features (for example, bag of words, TF-IDF) and their labels, estimate a function that predicts the label for new documents.

➩ Given emails, predict whether they are spams or hams.

➩ Given comments, predict whether they are offensive or not.

➩ Given reviews, predict whether they are positive or negative.

➩ Given essays, predict the grade A, B, ... or F.

➩ Given documents, predict which language it is from.

📗 If the training set is \(\left(X, Y\right)\), where \(X = \left(x_{1}, x_{2}, ..., x_{n}\right)\) are features of the documents, and \(Y = \left(y_{1}, y_{2}, ..., y_{n}\right)\) are labels, then the problem is to estimate \(\hat{\mathbb{P}}\left\{y | x\right\}\), and given a new document \(x'\), the predicted label can be the \(y'\) that maximizes \(\hat{\mathbb{P}}\left\{y' | x'\right\}\).

TopHat Discussion

📗 Play the "guess the prompt" game: Link, Link.

# Discriminative vs Generative Models

📗 Discriminative models directly estimate the probabilities \(\hat{\mathbb{P}}\left\{y | x\right\}\).

📗 Generative models estimate the likelihood probabilities \(\hat{\mathbb{P}}\left\{x | y\right\}\) and the prior probabilities \(\hat{\mathbb{P}}\left\{y\right\}\), then computes \(\hat{\mathbb{P}}\left\{y | x\right\} = \dfrac{\hat{\mathbb{P}}\left\{x | y\right\} \cdot \hat{\mathbb{P}}\left\{y\right\}}{\hat{\mathbb{P}}\left\{x | y = 1\right\} \cdot \hat{\mathbb{P}}\left\{y = 1\right\} + \hat{\mathbb{P}}\left\{x | y = 2\right\} \cdot \mathbb{P}\left\{y = 2\right\} + ... + \hat{\mathbb{P}}\left\{x | y = k\right\} \cdot \hat{\mathbb{P}}\left\{y = k\right\}}\) using Bayes rule.

TopHat Quiz

📗 [1 points] Consider the AmongUs example on Wikipedia, Image, what are the probabilities \(\mathbb{P}\left\{y = 0 | x = 1\right\}\) and \(\mathbb{P}\left\{y = 1 | x = 1\right\}\)?

Number of occurrences	Being suspicious \(x = 1\)	Not being suspicious \(x = 0\)
An assassin \(y = 1\)	3	1
Not an assassin \(y = 0\)	2	6

📗 Answer:

# Naive Bayes Classifier

📗 Naive Bayes classifier is a simple Bayesian network that assumes the features are independent: Wikipedia.

📗 The key assumption is the independence assumption: \(\mathbb{P}\left\{x_{i} | y\right\} = \mathbb{P}\left\{x_{i 1}, x_{i 2}, ..., x_{i m} | y\right\} = \mathbb{P}\left\{x_{i 1} | y\right\} \mathbb{P}\left\{x_{i 2} | y\right\} ... \mathbb{P}\left\{x_{i m} | y\right\}\).

TopHat Quiz

(Past Exam Question) ID:

📗 [4 points] Consider the problem of detecting if an email message is a spam. Say we use four random variables to model this problem: a binary class variable \(S\) indicates if the message is a spam, and three binary feature variables: \(C, F, N\) indicating whether the message contains "Cash", "Free", "Now". We use a Naive Bayes classifier with associated CPTs (Conditional Probability Table):

Prior	\(\mathbb{P}\left\{S = 1\right\}\) =	-	-
Hams	\(\mathbb{P}\left\{C = 1 \| S = 0\right\}\) =	\(\mathbb{P}\left\{F = 1 \| S = 0\right\}\) =	\(\mathbb{P}\left\{N = 1 \| S = 0\right\}\) =
Spams	\(\mathbb{P}\left\{C = 1 \| S = 1\right\}\) =	\(\mathbb{P}\left\{F = 1 \| S = 1\right\}\) =	\(\mathbb{P}\left\{N = 1 \| S = 1\right\}\) =

Compute \(\mathbb{P}\){\(S = 1\) | \(C\) = , \(F\) = , \(N\) = }.

📗 Answer: .

# Other Naive Bayes Models

📗 There are other common Naive Bayes models including multinomial naive Bayes (used when the features are bag of words without normalization) and Gaussian naive Bayes (used when the features are continuous).

📗 If the naive Bayes independence assumption is relaxed, the resulting more general model is called Bayesian network (or Bayes network).

Additional Note (Optional)

📗 If the features are bag of words (without normalization), then a common model of \(\mathbb{P}\left\{x_{i} | y\right\}\) is the multinomial model with unigram probabilities for each label: \(\mathbb{P}\left\{x_{i} | y\right\} = \dfrac{\left(x_{i 1} + x_{i 2} + ... + x_{i m}\right)!}{x_{i 1}! x_{i 2}! ... x_{i m}!} p_{y 1}^{x_{i 1}} p_{y 2}^{x_{i 2}} ... p_{y m}^{x_{i m}}\), where \(p_{y j}\) is the unigram probability that word \(j\) appears in a document with label \(y\).

➩ A special case when \(x_{ij}\) is binary or \(x_{ij} = 0, 1\), for example, whether a document contains a word type, is called Bernoulli naive Bayes.

➩ Technically, in the multinomial distribution, \(x_{i 1}, x_{i 2}, ..., x_{i m}\) are not independent due to the \(\left(x_{i 1} + x_{i 2} + ... + x_{i m}\right)!\), but the multinomial Bayes model is still considered "naive".

➩ Multinomial naive Bayes is consider a linear model since the log posterior distribution is linear in the features: \(\log \mathbb{P}\left\{y | x_{i}\right\} = c \left(\log \mathbb{P}\left\{y\right\} + x_{i 1} \log p_{y 1} + x_{i 2} \log p_{y 2} + ... + x_{i m} \log p_{y m}\right)\), where \(c\) is some constant.

📗 If the features are continuous (not binary or integer counts), then a common model is the Gaussian naive Bayes model: \(\mathbb{P}\left\{x_{i} | y\right\} = \mathbb{P}\left\{x_{i 1} | y\right\} \mathbb{P}\left\{x_{i 2} | y\right\} ... \mathbb{P}\left\{x_{i m} | y\right\}\), where \(\mathbb{P}\left\{x_{ij} | y\right\} = \dfrac{1}{\sqrt{2 \pi \sigma_{y j}^{2}}} e^{- \dfrac{\left(x_{ij} - \mu_{y j}\right)^{2}}{2 \sigma_{y j}^{2}}}\), where \(\mu_{y j}\) is the mean of feature \(j\) for documents with label \(y\), and \(\sigma^{2}_{y j}\) is the variance.

➩ The maximum likelihood estimates of \(\mu_{y j}\) is the sample mean of the feature \(j\) for documents with label \(y\), and \(\sigma^{2}_{y j}\) is the sample variance.

📗 If the naive Bayes independence assumption is relaxed, the resulting model is called a Bayesian network (or Bayes network). Some examples of Bayesian networks: Wikipedia, Link, Link, Link.

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yingyu Liang, and Charles Dyer.

📗 Content from note blocks marked "optional" and content from Wikipedia and other demo links are helpful for understanding the materials, but will not be explicitly tested on the exams.

📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.

📗 Anonymous feedback can be submitted to: Form.

Prev: L1, Next: L3

Last Updated: July 01, 2025 at 1:47 AM

Prior	\(\mathbb{P}\left\{S = 1\right\}\) =	-	-
Hams	\(\mathbb{P}\left\{C = 1 \| S = 0\right\}\) =	\(\mathbb{P}\left\{F = 1 \| S = 0\right\}\) =	\(\mathbb{P}\left\{N = 1 \| S = 0\right\}\) =
Spams	\(\mathbb{P}\left\{C = 1 \| S = 1\right\}\) =	\(\mathbb{P}\left\{F = 1 \| S = 1\right\}\) =	\(\mathbb{P}\left\{N = 1 \| S = 1\right\}\) =