Young Wu's Homepage

Prev: L30, Next: L32

# Lecture

📗 The lecture is in person, but you can join Zoom: 8:50-9:40 or 11:00-11:50. Zoom recordings can be viewed on Canvas -> Zoom -> Cloud Recordings. They will be moved to Kaltura over the weekends.

📗 The in-class (participation) quizzes should be submitted on TopHat (Code:741565), but you can submit your answers through Form at the end of the lectures too.

📗 The Python notebooks used during the lectures can also be found on: GitHub. They will be updated weekly.

# Lecture Notes

TopHat Exam 2 Survey

➩ Is Exam 2 hard?

➩ A: Very easy

➩ B: Easy

➩ C: OK

➩ D: Hard

➩ E: Very hard

Binary Classification

➩ For binary classification, the labels are binary, either 0 or 1, but the output of the classifier \(\hat{f}\left(x\right)\) can be a number between 0 and 1.

➩ \(\hat{f}\left(x\right)\) usually represents the probability that the label is 1, and it is sometimes called the activation value.

➩ If a deterministic prediction \(\hat{y}\) is required, it is usually set to \(\hat{y} = 0\) if \(\hat{f}\left(x\right) \leq 0.5\) and \(\hat{y} = 1\) if \(\hat{f}\left(x\right) > 0.5\).

Item	Input (Features)	Output (Labels)	-
1	\(\left(x_{11}, x_{12}, ..., x_{1m}\right)\)	\(y_{1} \in \left\{0, 1\right\}\)	training data
2	\(\left(x_{21}, x_{22}, ..., x_{2m}\right)\)	\(y_{2} \in \left\{0, 1\right\}\)	-
3	\(\left(x_{31}, x_{32}, ..., x_{3m}\right)\)	\(y_{3} \in \left\{0, 1\right\}\)	-
...	...	...	...
n	\(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\)	\(y_{n} \in \left\{0, 1\right\}\)	used to figure out \(y \approx \hat{f}\left(x\right)\)
new	\(\left(x'_{1}, x'_{2}, ..., x'_{m}\right)\)	\(y' \in \left[0, 1\right]\)	guess \(y' = 1\) with probability \(\hat{f}\left(x\right)\)

Confusion Matrix

➩ A confusion matrix contains the counts of items with every label-prediction combinations, in case of binary classifications, it is a 2 by 2 matrix with the following entries:

(1) number of item labeled 1 and predicted to be 1 (true positive: TP),
(2) number of item labeled 1 and predicted to be 0 (false negative: FN),
(3) number of item labeled 0 and predicted to be 1 (false positive: FP),
(4) number of item labeled 0 and predicted to be 0 (true negative: TN).

Count	Predict 1	Predict 0
Label 1	TP	FN
Label 0	FP	TN

📗 Precision and Recall

➩ Precision is the positive predictive value, or \(\dfrac{TP}{TP + FP}\).

➩ Recall is the true positive rate, or \(\dfrac{TP}{TP + FN}\).

➩ F-measure (or F1 score) is \(2 \cdot \dfrac{p \cdot r}{p + r}\), where \(p\) is the precision and \(r\) is the recall: Link.

Spam Example

➩ Use the 80 percent of the spambase dataset Link to train a spam detector and test it on the remaining 20 percent of the dataset. Use a linear support vector machine as the spam detector, and compare it with a random spam detector that predicts spam at random with probability 0.5.

➩ Code to make the predictions: Notebook.

➩ The confusion matrix for the SVM detector is:

Count	Predict Spam	Predict Ham
Label Spam	506	25
Label Ham	50	340

➩ Precision is \(\dfrac{506}{506 + 50} \approx 0.91\), Recall is \(\dfrac{506}{506 + 25} \approx 0.95\), so the F measure is \(2 \cdot \dfrac{0.91 \cdot 0.95}{0.91 + 0.95} \approx 0.93\).

➩ The confusion matrix for the perfect spam detector is:

Count	Predict Spam	Predict Ham
Label Spam	531	0
Label Ham	0	390

➩ Precision is \(\dfrac{531}{531 + 0} = 1\), Recall is \(\dfrac{390}{390 + 0} = 1\), so the F measure is \(2 \cdot \dfrac{1 \cdot 1}{1 + 1} = 1\), the highest possible value.

➩ The confusion matrix for the random detector is:

Count	Predict Spam	Predict Ham
Label Spam	278	253
Label Ham	192	198

➩ Precision is \(\dfrac{278}{278 + 192} \approx 0.60\), Recall is \(\dfrac{278}{278 + 253} \approx 0.52\), so the F measure is \(2 \cdot \dfrac{0.60 \cdot 0.52}{0.60 + 0.52} \approx 0.56\).

Multi-class Classification

➩ Multi-class classification is different from regression since the classes are not ordered.

➩ Three approaches are used:

(1) Directly predict the probability of each class.
(2) One-vs-one classifiers.
(3) One-vs-all (One-vs-rest) classifiers.

📗 Class Probabilities

➩ If there are \(k\) classes, the classifier could output \(k\) numbers between 0 and 1, and sum up to 1, to represent the probability that the item belongs to each class, sometimes called activation values.

➩ If a deterministic prediction is required, the classifier can predict the label with the largest probability.

📗 One-vs-one Classifiers

➩ If there are \(k\) classes, then \(\dfrac{1}{2} k \left(k - 1\right)\) binary classifiers will be trained, 1 vs 2, 1 vs 3, 2 vs 3, ...

➩ The prediction can be the label that receives the largest number of votes from the one-vs-one binary classifiers.

📗 One-vs-all Classifiers

➩ If there are \(k\) classes, then \(k\) binary classifiers will be trained, 1 vs not 1, 2 vs not 2, 3 vs not 3, ...

➩ The prediction can be the label that achieves the largest probability in the one-vs-all binary classifiers.

Multi-class Confusion Matrix

➩ A three-class confusion matrix can be written as follows.

Count	Predict 0	Predict 1	Predict 2
Class 0	\(c_{y = 0, \hat{y} = 0}\)	\(c_{y = 0, \hat{y} = 1}\)	\(c_{y = 0, \hat{y} = 2}\)
Class 1	\(c_{y = 1, \hat{y} = 0}\)	\(c_{y = 1, \hat{y} = 1}\)	\(c_{y = 1, \hat{y} = 2}\)
Class 2	\(c_{y = 2, \hat{y} = 0}\)	\(c_{y = 2, \hat{y} = 1}\)	\(c_{y = 2, \hat{y} = 2}\)

➩ Precision and recall can be defined the same way as before, for example, precision of class \(i\) is \(\dfrac{c_{i i}}{\displaystyle\sum_{j} c_{j i}}\), and recall of class \(i\) is \(\dfrac{c_{i i}}{\displaystyle\sum_{j} c_{i j}}\).

MNIST Example

➩ Use the MNIST dataset Link or the csv files: Link, and train a handwritten digit recognition classifier using logistic regression.

➩ Plot the confusion matrix and some examples of incorrectly classified images.

➩ Code to make the predictions with multinomial logistic regression: Notebook.

➩ Code to make the predictions with one vs all (or one vs rest) logistic regression: Notebook.

📗 Conversion from Probabilities to Predictions

➩ Probability predictions are given in a matrix, where row \(i\) is the probability prediction for item \(i\), and column \(j\) is the predicted probability that item \(i\) has label \(j\).

➩ Given an \(n \times m\) probability prediction matrix p, p.max(axis = 1) computes the \(n \times 1\) vector of maximum probabilities, one for each row, and p.argmax(axis = 1) computes the column indices of those maximum probabilities, which also corresponds to the predicted labels of the items.

➩ For example, if p is \(\begin{bmatrix} 0.1 & 0.2 & 0.7 \\ 0.8 & 0.1 & 0.1 \\ 0.4 & 0.5 & 0.1 \end{bmatrix}\), then p.max(axis = 1) is \(\begin{bmatrix} 0.7 \\ 0.8 \\ 0.5 \end{bmatrix}\) and p.argmax(axis = 1) is \(\begin{bmatrix} 2 \\ 0 \\ 1 \end{bmatrix}\).

➩ Note: p.max(axis = 0) and p.argmax(axis = 0) would compute max and argmax along the columns.

Note on Pixel Intensity

➩ Pixel intensity usually uses the convention that 0 means black and 1 means white, as done in matplotlib.imshow and skimage.imshow, unlike the example from the previous lecture.

Notes and code adapted from the course taught by Yiyin Shen Link and Tyler Caraza-Harter Link

Last Updated: April 23, 2025 at 2:49 AM