# Lecture Notes
TopHat Exam 2 Survey
➩ Is Exam 2 hard?
➩ A: Very easy
➩ B: Easy
➩ C: OK
➩ D: Hard
➩ E: Very hard
Binary Classification
➩ For binary classification, the labels are binary, either 0 or 1, but the output of the classifier \(\hat{f}\left(x\right)\) can be a number between 0 and 1.
➩ \(\hat{f}\left(x\right)\) usually represents the probability that the label is 1, and it is sometimes called the activation value.
➩ If a deterministic prediction \(\hat{y}\) is required, it is usually set to \(\hat{y} = 0\) if \(\hat{f}\left(x\right) \leq 0.5\) and \(\hat{y} = 1\) if \(\hat{f}\left(x\right) > 0.5\).
Item |
Input (Features) |
Output (Labels) |
- |
1 |
\(\left(x_{11}, x_{12}, ..., x_{1m}\right)\) |
\(y_{1} \in \left\{0, 1\right\}\) |
training data |
2 |
\(\left(x_{21}, x_{22}, ..., x_{2m}\right)\) |
\(y_{2} \in \left\{0, 1\right\}\) |
- |
3 |
\(\left(x_{31}, x_{32}, ..., x_{3m}\right)\) |
\(y_{3} \in \left\{0, 1\right\}\) |
- |
... |
... |
... |
... |
n |
\(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\) |
\(y_{n} \in \left\{0, 1\right\}\) |
used to figure out \(y \approx \hat{f}\left(x\right)\) |
new |
\(\left(x'_{1}, x'_{2}, ..., x'_{m}\right)\) |
\(y' \in \left[0, 1\right]\) |
guess \(y' = 1\) with probability \(\hat{f}\left(x\right)\) |
Confusion Matrix
➩ A confusion matrix contains the counts of items with every label-prediction combinations, in case of binary classifications, it is a 2 by 2 matrix with the following entries:
(1) number of item labeled 1 and predicted to be 1 (true positive: TP),
(2) number of item labeled 1 and predicted to be 0 (false negative: FN),
(3) number of item labeled 0 and predicted to be 1 (false positive: FP),
(4) number of item labeled 0 and predicted to be 0 (true negative: TN).
Count |
Predict 1 |
Predict 0 |
Label 1 |
TP |
FN |
Label 0 |
FP |
TN |
📗 Precision and Recall
➩ Precision is the positive predictive value, or \(\dfrac{TP}{TP + FP}\).
➩ Recall is the true positive rate, or \(\dfrac{TP}{TP + FN}\).
➩ F-measure (or F1 score) is \(2 \cdot \dfrac{p \cdot r}{p + r}\), where \(p\) is the precision and \(r\) is the recall:
Link.
Spam Example
➩ Use the 80 percent of the spambase dataset
Link to train a spam detector and test it on the remaining 20 percent of the dataset. Use a linear support vector machine as the spam detector, and compare it with a random spam detector that predicts spam at random with probability 0.5.
➩ Code to make the predictions:
Notebook.
➩ The confusion matrix for the SVM detector is:
Count |
Predict Spam |
Predict Ham |
Label Spam |
506 |
25 |
Label Ham |
50 |
340 |
➩ Precision is \(\dfrac{506}{506 + 50} \approx 0.91\), Recall is \(\dfrac{506}{506 + 25} \approx 0.95\), so the F measure is \(2 \cdot \dfrac{0.91 \cdot 0.95}{0.91 + 0.95} \approx 0.93\).
➩ The confusion matrix for the perfect spam detector is:
Count |
Predict Spam |
Predict Ham |
Label Spam |
531 |
0 |
Label Ham |
0 |
390 |
➩ Precision is \(\dfrac{531}{531 + 0} = 1\), Recall is \(\dfrac{390}{390 + 0} = 1\), so the F measure is \(2 \cdot \dfrac{1 \cdot 1}{1 + 1} = 1\), the highest possible value.
➩ The confusion matrix for the random detector is:
Count |
Predict Spam |
Predict Ham |
Label Spam |
278 |
253 |
Label Ham |
192 |
198 |
➩ Precision is \(\dfrac{278}{278 + 192} \approx 0.60\), Recall is \(\dfrac{278}{278 + 253} \approx 0.52\), so the F measure is \(2 \cdot \dfrac{0.60 \cdot 0.52}{0.60 + 0.52} \approx 0.56\).
Multi-class Classification
➩ Multi-class classification is different from regression since the classes are not ordered.
➩ Three approaches are used:
(1) Directly predict the probability of each class.
(2) One-vs-one classifiers.
(3) One-vs-all (One-vs-rest) classifiers.
📗 Class Probabilities
➩ If there are \(k\) classes, the classifier could output \(k\) numbers between 0 and 1, and sum up to 1, to represent the probability that the item belongs to each class, sometimes called activation values.
➩ If a deterministic prediction is required, the classifier can predict the label with the largest probability.
📗 One-vs-one Classifiers
➩ If there are \(k\) classes, then \(\dfrac{1}{2} k \left(k - 1\right)\) binary classifiers will be trained, 1 vs 2, 1 vs 3, 2 vs 3, ...
➩ The prediction can be the label that receives the largest number of votes from the one-vs-one binary classifiers.
📗 One-vs-all Classifiers
➩ If there are \(k\) classes, then \(k\) binary classifiers will be trained, 1 vs not 1, 2 vs not 2, 3 vs not 3, ...
➩ The prediction can be the label that achieves the largest probability in the one-vs-all binary classifiers.
Multi-class Confusion Matrix
➩ A three-class confusion matrix can be written as follows.
Count |
Predict 0 |
Predict 1 |
Predict 2 |
Class 0 |
\(c_{y = 0, \hat{y} = 0}\) |
\(c_{y = 0, \hat{y} = 1}\) |
\(c_{y = 0, \hat{y} = 2}\) |
Class 1 |
\(c_{y = 1, \hat{y} = 0}\) |
\(c_{y = 1, \hat{y} = 1}\) |
\(c_{y = 1, \hat{y} = 2}\) |
Class 2 |
\(c_{y = 2, \hat{y} = 0}\) |
\(c_{y = 2, \hat{y} = 1}\) |
\(c_{y = 2, \hat{y} = 2}\) |
➩ Precision and recall can be defined the same way as before, for example, precision of class \(i\) is \(\dfrac{c_{i i}}{\displaystyle\sum_{j} c_{j i}}\), and recall of class \(i\) is \(\dfrac{c_{i i}}{\displaystyle\sum_{j} c_{i j}}\).
MNIST Example
➩ Use the MNIST dataset
Link or the csv files:
Link, and train a handwritten digit recognition classifier using logistic regression.
➩ Plot the confusion matrix and some examples of incorrectly classified images.
➩ Code to make the predictions with multinomial logistic regression:
Notebook.
➩ Code to make the predictions with one vs all (or one vs rest) logistic regression:
Notebook.
📗 Conversion from Probabilities to Predictions
➩ Probability predictions are given in a matrix, where row \(i\) is the probability prediction for item \(i\), and column \(j\) is the predicted probability that item \(i\) has label \(j\).
➩ Given an \(n \times m\) probability prediction matrix p
, p.max(axis = 1)
computes the \(n \times 1\) vector of maximum probabilities, one for each row, and p.argmax(axis = 1)
computes the column indices of those maximum probabilities, which also corresponds to the predicted labels of the items.
➩ For example, if p
is \(\begin{bmatrix} 0.1 & 0.2 & 0.7 \\ 0.8 & 0.1 & 0.1 \\ 0.4 & 0.5 & 0.1 \end{bmatrix}\), then p.max(axis = 1)
is \(\begin{bmatrix} 0.7 \\ 0.8 \\ 0.5 \end{bmatrix}\) and p.argmax(axis = 1)
is \(\begin{bmatrix} 2 \\ 0 \\ 1 \end{bmatrix}\).
➩ Note: p.max(axis = 0)
and p.argmax(axis = 0)
would compute max and argmax along the columns.
Note on Pixel Intensity
➩ Pixel intensity usually uses the convention that 0 means black and 1 means white, as done in matplotlib.imshow
and skimage.imshow
, unlike the example from the previous lecture.
Notes and code adapted from the course taught by Yiyin Shen
Link and Tyler Caraza-Harter
Link