Prev: L30, Next: L32

# Lecture

📗 The lecture is in person, but you can join Zoom: 8:50-9:40 or 11:00-11:50. Zoom recordings can be viewed on Canvas -> Zoom -> Cloud Recordings. They will be moved to Kaltura over the weekends.
📗 The in-class (participation) quizzes should be submitted on TopHat (Code:741565), but you can submit your answers through Form at the end of the lectures too.
📗 The Python notebooks used during the lectures can also be found on: GitHub. They will be updated weekly.


# Lecture Notes

TopHat Exam 2 Survey
➩ Is Exam 2 hard?
➩ A: Very easy
➩ B: Easy
➩ C: OK
➩ D: Hard
➩ E: Very hard

 Binary Classification
➩ For binary classification, the labels are binary, either 0 or 1, but the output of the classifier \(\hat{f}\left(x\right)\) can be a number between 0 and 1.
➩ \(\hat{f}\left(x\right)\) usually represents the probability that the label is 1, and it is sometimes called the activation value.
➩ If a deterministic prediction \(\hat{y}\) is required, it is usually set to \(\hat{y} = 0\) if \(\hat{f}\left(x\right) \leq 0.5\) and \(\hat{y} = 1\) if \(\hat{f}\left(x\right) > 0.5\).

Item Input (Features) Output (Labels) -
1 \(\left(x_{11}, x_{12}, ..., x_{1m}\right)\) \(y_{1} \in \left\{0, 1\right\}\) training data
2 \(\left(x_{21}, x_{22}, ..., x_{2m}\right)\) \(y_{2} \in \left\{0, 1\right\}\) -
3 \(\left(x_{31}, x_{32}, ..., x_{3m}\right)\) \(y_{3} \in \left\{0, 1\right\}\) -
... ... ... ...
n \(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\) \(y_{n} \in \left\{0, 1\right\}\) used to figure out \(y \approx \hat{f}\left(x\right)\)
new \(\left(x'_{1}, x'_{2}, ..., x'_{m}\right)\) \(y' \in \left[0, 1\right]\) guess \(y' = 1\) with probability \(\hat{f}\left(x\right)\)


 Confusion Matrix
➩ A confusion matrix contains the counts of items with every label-prediction combinations, in case of binary classifications, it is a 2 by 2 matrix with the following entries:
(1) number of item labeled 1 and predicted to be 1 (true positive: TP),
(2) number of item labeled 1 and predicted to be 0 (false negative: FN),
(3) number of item labeled 0 and predicted to be 1 (false positive: FP),
(4) number of item labeled 0 and predicted to be 0 (true negative: TN).

Count Predict 1 Predict 0
Label 1 TP FN
Label 0 FP TN


📗 Precision and Recall
➩ Precision is the positive predictive value, or \(\dfrac{TP}{TP + FP}\).
➩ Recall is the true positive rate, or \(\dfrac{TP}{TP + FN}\).
➩ F-measure (or F1 score) is \(2 \cdot \dfrac{p \cdot r}{p + r}\), where \(p\) is the precision and \(r\) is the recall: Link.

Spam Example
➩ Use the 80 percent of the spambase dataset Link to train a spam detector and test it on the remaining 20 percent of the dataset. Use a linear support vector machine as the spam detector, and compare it with a random spam detector that predicts spam at random with probability 0.5.
➩ Code to make the predictions: Notebook.
➩ The confusion matrix for the SVM detector is:
Count Predict Spam Predict Ham
Label Spam 506 25
Label Ham 50 340

➩ Precision is \(\dfrac{506}{506 + 50} \approx 0.91\), Recall is \(\dfrac{506}{506 + 25} \approx 0.95\), so the F measure is \(2 \cdot \dfrac{0.91 \cdot 0.95}{0.91 + 0.95} \approx 0.93\).
➩ The confusion matrix for the perfect spam detector is:
Count Predict Spam Predict Ham
Label Spam 531 0
Label Ham 0 390

➩ Precision is \(\dfrac{531}{531 + 0} = 1\), Recall is \(\dfrac{390}{390 + 0} = 1\), so the F measure is \(2 \cdot \dfrac{1 \cdot 1}{1 + 1} = 1\), the highest possible value.
➩ The confusion matrix for the random detector is:
Count Predict Spam Predict Ham
Label Spam 278 253
Label Ham 192 198

➩ Precision is \(\dfrac{278}{278 + 192} \approx 0.60\), Recall is \(\dfrac{278}{278 + 253} \approx 0.52\), so the F measure is \(2 \cdot \dfrac{0.60 \cdot 0.52}{0.60 + 0.52} \approx 0.56\).

 Multi-class Classification
➩ Multi-class classification is different from regression since the classes are not ordered.
➩ Three approaches are used:
(1) Directly predict the probability of each class.
(2) One-vs-one classifiers.
(3) One-vs-all (One-vs-rest) classifiers.

📗 Class Probabilities
➩ If there are \(k\) classes, the classifier could output \(k\) numbers between 0 and 1, and sum up to 1, to represent the probability that the item belongs to each class, sometimes called activation values.
➩ If a deterministic prediction is required, the classifier can predict the label with the largest probability.

📗 One-vs-one Classifiers
➩ If there are \(k\) classes, then \(\dfrac{1}{2} k \left(k - 1\right)\) binary classifiers will be trained, 1 vs 2, 1 vs 3, 2 vs 3, ...
➩ The prediction can be the label that receives the largest number of votes from the one-vs-one binary classifiers.

📗 One-vs-all Classifiers
➩ If there are \(k\) classes, then \(k\) binary classifiers will be trained, 1 vs not 1, 2 vs not 2, 3 vs not 3, ...
➩ The prediction can be the label that achieves the largest probability in the one-vs-all binary classifiers.

 Multi-class Confusion Matrix
➩ A three-class confusion matrix can be written as follows.
Count Predict 0 Predict 1 Predict 2
Class 0 \(c_{y = 0, \hat{y} = 0}\) \(c_{y = 0, \hat{y} = 1}\) \(c_{y = 0, \hat{y} = 2}\)
Class 1 \(c_{y = 1, \hat{y} = 0}\) \(c_{y = 1, \hat{y} = 1}\) \(c_{y = 1, \hat{y} = 2}\)
Class 2 \(c_{y = 2, \hat{y} = 0}\) \(c_{y = 2, \hat{y} = 1}\) \(c_{y = 2, \hat{y} = 2}\)

➩ Precision and recall can be defined the same way as before, for example, precision of class \(i\) is \(\dfrac{c_{i i}}{\displaystyle\sum_{j} c_{j i}}\), and recall of class \(i\) is \(\dfrac{c_{i i}}{\displaystyle\sum_{j} c_{i j}}\).

MNIST Example
➩ Use the MNIST dataset Link or the csv files: Link, and train a handwritten digit recognition classifier using logistic regression.
➩ Plot the confusion matrix and some examples of incorrectly classified images.
➩ Code to make the predictions with multinomial logistic regression: Notebook.
➩ Code to make the predictions with one vs all (or one vs rest) logistic regression: Notebook.

📗 Conversion from Probabilities to Predictions
➩ Probability predictions are given in a matrix, where row \(i\) is the probability prediction for item \(i\), and column \(j\) is the predicted probability that item \(i\) has label \(j\).
➩ Given an \(n \times m\) probability prediction matrix p, p.max(axis = 1) computes the \(n \times 1\) vector of maximum probabilities, one for each row, and p.argmax(axis = 1) computes the column indices of those maximum probabilities, which also corresponds to the predicted labels of the items.
➩ For example, if p is \(\begin{bmatrix} 0.1 & 0.2 & 0.7 \\ 0.8 & 0.1 & 0.1 \\ 0.4 & 0.5 & 0.1 \end{bmatrix}\), then p.max(axis = 1) is \(\begin{bmatrix} 0.7 \\ 0.8 \\ 0.5 \end{bmatrix}\) and p.argmax(axis = 1) is \(\begin{bmatrix} 2 \\ 0 \\ 1 \end{bmatrix}\).
➩ Note: p.max(axis = 0) and p.argmax(axis = 0) would compute max and argmax along the columns.

Note on Pixel Intensity
➩ Pixel intensity usually uses the convention that 0 means black and 1 means white, as done in matplotlib.imshow and skimage.imshow, unlike the example from the previous lecture.


 Notes and code adapted from the course taught by Yiyin Shen Link and Tyler Caraza-Harter Link






Last Updated: November 30, 2024 at 4:34 AM