Young Wu's Homepage

Prev: W10, Next: W12

Zoom: Link, TopHat: Link (936525), GoogleForm: Link, Piazza: Link, Feedback: Link, GitHub: Link, Sec1&2: Link

Slide:

# Linear Classifiers

📗 A classifier is linear if the decision boundary is linear (line in 2D, plane in 3D, hyperplane in higher dimensions).

➩ Points above the plane (in the direction of the normal of the plane) are predicted as 1, and points below the plane are predicted as 0.

➩ A set of linear coefficients, usually estimated based on the training data, called weights, \(w_{1}, w_{2}, ..., w_{m}\), and a bias term \(b\), so that the classifier can be written in the form \(\hat{y} = 1\) if \(w_{1} x'_{1} + w_{2} x'_{2} + ... + w_{m} x'_{n} + b \geq 0\) and \(\hat{y} = 0\) if \(w_{1} x'_{1} + w_{2} x'_{2} + ... + w_{m} x'_{n} + b < 0\).

# Activation Function

📗 Sometimes, if a probabilistic prediction is needed, \(\hat{f}\left(x'\right) = g\left(w_{1} x'_{1} + w_{2} x'_{2} + ... + w_{m} x'_{m} + b\right)\) outputs a number between 0 and 1, and the classifier can be written in the form \(\hat{y} = 1\) if \(\hat{f}\left(x'\right) \geq 0.5\) and \(\hat{y} = 0\) if \(\hat{f}\left(x'\right) < 0.5\).

➩ The function \(g\) can be any non-linear function, the resulting classifier is still linear.

# Examples of Linear Classifiers

📗 Linear threshold unit (LTU) perceptron: \(g\left(z\right) = 1\) if \(z > 0\) and \(g\left(z\right) = 0\) otherwise.

📗 Logistic regression: \(g\left(z\right) = \dfrac{1}{1 + e^{-z}}\), see Link.

📗 Support vector machine (SVM): \(g\left(z\right) = 1\) if \(z > 0\) and \(g\left(z\right) = 0\) otherwise, but with a different method to find the weights (that maximizes the separation between the two classes).

# Logistic Regression

📗 Logistic regression is usually used for classification, not regression.

➩ The function \(g\left(z\right) = \dfrac{1}{1 + e^{-z}}\) is called the logistic activation function (or sigmoid function).

📗 How well the weights fit the training data is measured by a loss (or cost) function, which is the sum over the loss from each training item \(C\left(w\right) = C_{1}\left(w\right) + C_{2}\left(w\right) + ... + C_{n}\left(w\right)\), and for logistic regression, \(C_{i}\left(w\right) = - y_{i} \log g\left(z_{i}\right) - \left(1 - y_{i}\right) \log \left(1 - g\left(z_{i}\right)\right)\), \(z_{i} = w_{1} x_{i1} + w_{2} x_{i2} + ... + w_{m} x_{im} + b\), called the cross-entropy loss, is usually used.

➩ Given the weights and bias \(w = \left(w_{1}, w_{2}, ..., w_{m}\right), b\), the probabilistic prediction that a new item \(x' = \left(x'_{1}, x'_{2}, ..., x'_{m}\right)\) belongs to class 1 can be computed as \(\dfrac{1}{1 + e^{- \left(w_{1} x'_{1} + w_{2} x'_{2} + ... + w_{m} x'_{m} + b\right)}}\), or 1 / (1 + exp(- (x' @ w + b))), where @ is the dot product (or matrix product in general).

TopHat Discussion

ID:

➩ Move the sliders below to change the green plane normal so that the largest number of the blue points are above the plane and the largest number of the red points are below the plane. The current number of mistakes is ???.

\(w_{1}\) = 0
\(w_{2}\) = 0
\(w_{3}\) = 1
\(b\) = 0

➩ Move the sliders below to change the green plane normal so that the total loss from blue points below the plane and the red points above the plane is minimized. The current total cost is ???.

\(w_{1}\) = 0
\(w_{2}\) = 0
\(w_{3}\) = 1
\(b\) = 0

# Interpretation of Coefficients

📗 The weight \(w_{j}\) in front of feature \(j\) can be interpreted as the increase in the log-odds of the label \(y\) being 1, associated with the increase of 1 unit in \(x_{j}\), holding all other variables constant.

➩ This means, if the feature \(x'_{j}\) increases by 1, then the odds of \(y\) being 1 is increased by \(e^{w_{j}}\).

➩ The bias \(b\) is the log-odds of \(y\) being 1, when \(x'_{1} = x'_{2} = ... = x'_{m} = 0\).

# Binary Classification

📗 For binary classification, the labels are binary, either 0 or 1, but the output of the classifier \(\hat{f}\left(x\right)\) can be a number between 0 and 1.

➩ \(\hat{f}\left(x\right)\) usually represents the probability that the label is 1, and it is sometimes called the activation value.

➩ If a deterministic prediction \(\hat{y}\) is required, it is usually set to \(\hat{y} = 0\) if \(\hat{f}\left(x\right) \leq 0.5\) and \(\hat{y} = 1\) if \(\hat{f}\left(x\right) > 0.5\).

Item	Input (Features)	Output (Labels)	-
1	\(\left(x_{11}, x_{12}, ..., x_{1m}\right)\)	\(y_{1} \in \left\{0, 1\right\}\)	training data
2	\(\left(x_{21}, x_{22}, ..., x_{2m}\right)\)	\(y_{2} \in \left\{0, 1\right\}\)	-
3	\(\left(x_{31}, x_{32}, ..., x_{3m}\right)\)	\(y_{3} \in \left\{0, 1\right\}\)	-
...	...	...	...
n	\(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\)	\(y_{n} \in \left\{0, 1\right\}\)	used to figure out \(y \approx \hat{f}\left(x\right)\)
new	\(\left(x'_{1}, x'_{2}, ..., x'_{m}\right)\)	\(y' \in \left[0, 1\right]\)	guess \(y' = 1\) with probability \(\hat{f}\left(x\right)\)

# Confusion Matrix

📗 A confusion matrix contains the counts of items with every label-prediction combinations, in case of binary classifications, it is a 2 by 2 matrix with the following entries:

(1) number of item labeled 1 and predicted to be 1 (true positive: TP),
(2) number of item labeled 1 and predicted to be 0 (false negative: FN),
(3) number of item labeled 0 and predicted to be 1 (false positive: FP),
(4) number of item labeled 0 and predicted to be 0 (true negative: TN).

Count	Predict 1	Predict 0
Label 1	TP	FN
Label 0	FP	TN

# Precision and Recall

📗 Precision is the positive predictive value, or \(\dfrac{TP}{TP + FP}\).

➩ Recall is the true positive rate, or \(\dfrac{TP}{TP + FN}\).

➩ F-measure (or F1 score) is \(2 \cdot \dfrac{p \cdot r}{p + r}\), where \(p\) is the precision and \(r\) is the recall: Link.

Spam Example

➩ Use the 80 percent of the spambase dataset Link to train a spam detector and test it on the remaining 20 percent of the dataset. Use a linear support vector machine as the spam detector, and compare it with a random spam detector that predicts spam at random with probability 0.5.

➩ The confusion matrix for the SVM detector is:

Count	Predict Spam	Predict Ham
Label Spam	506	25
Label Ham	50	340

➩ Precision is \(\dfrac{506}{506 + 50} \approx 0.91\), Recall is \(\dfrac{506}{506 + 25} \approx 0.95\), so the F measure is \(2 \cdot \dfrac{0.91 \cdot 0.95}{0.91 + 0.95} \approx 0.93\).

➩ The confusion matrix for the perfect spam detector is:

Count	Predict Spam	Predict Ham
Label Spam	531	0
Label Ham	0	390

➩ Precision is \(\dfrac{531}{531 + 0} = 1\), Recall is \(\dfrac{390}{390 + 0} = 1\), so the F measure is \(2 \cdot \dfrac{1 \cdot 1}{1 + 1} = 1\), the highest possible value.

➩ The confusion matrix for the random detector is:

Count	Predict Spam	Predict Ham
Label Spam	278	253
Label Ham	192	198

➩ Precision is \(\dfrac{278}{278 + 192} \approx 0.60\), Recall is \(\dfrac{278}{278 + 253} \approx 0.52\), so the F measure is \(2 \cdot \dfrac{0.60 \cdot 0.52}{0.60 + 0.52} \approx 0.56\).

# Multi-class Classification

📗 Multi-class classification is different from regression since the classes are not ordered.

➩ Three approaches are used:

(1) Directly predict the probability of each class.
(2) One-vs-one classifiers.
(3) One-vs-all (One-vs-rest) classifiers.

📗 If there are \(k\) classes, the classifier could output \(k\) numbers between 0 and 1, and sum up to 1, to represent the probability that the item belongs to each class, sometimes called activation values.

➩ If a deterministic prediction is required, the classifier can predict the label with the largest probability.

# Multi-class Confusion Matrix

📗 A three-class confusion matrix can be written as follows.

Count	Predict 0	Predict 1	Predict 2
Class 0	\(c_{y = 0, \hat{y} = 0}\)	\(c_{y = 0, \hat{y} = 1}\)	\(c_{y = 0, \hat{y} = 2}\)
Class 1	\(c_{y = 1, \hat{y} = 0}\)	\(c_{y = 1, \hat{y} = 1}\)	\(c_{y = 1, \hat{y} = 2}\)
Class 2	\(c_{y = 2, \hat{y} = 0}\)	\(c_{y = 2, \hat{y} = 1}\)	\(c_{y = 2, \hat{y} = 2}\)

➩ Precision and recall can be defined the same way as before, for example, precision of class \(i\) is \(\dfrac{c_{i i}}{\displaystyle\sum_{j} c_{j i}}\), and recall of class \(i\) is \(\dfrac{c_{i i}}{\displaystyle\sum_{j} c_{i j}}\).

MNIST Example

➩ Use the MNIST dataset Link or the csv files: Link, and train a handwritten digit recognition classifier using logistic regression.

➩ Plot the confusion matrix and some examples of incorrectly classified images.

# Conversion from Probabilities to Predictions

📗 Probability predictions are given in a matrix, where row \(i\) is the probability prediction for item \(i\), and column \(j\) is the predicted probability that item \(i\) has label \(j\).

➩ Given an \(n \times m\) probability prediction matrix p, p.max(axis = 1) computes the \(n \times 1\) vector of maximum probabilities, one for each row, and p.argmax(axis = 1) computes the column indices of those maximum probabilities, which also corresponds to the predicted labels of the items.

➩ For example, if p is \(\begin{bmatrix} 0.1 & 0.2 & 0.7 \\ 0.8 & 0.1 & 0.1 \\ 0.4 & 0.5 & 0.1 \end{bmatrix}\), then p.max(axis = 1) is \(\begin{bmatrix} 0.7 \\ 0.8 \\ 0.5 \end{bmatrix}\) and p.argmax(axis = 1) is \(\begin{bmatrix} 2 \\ 0 \\ 1 \end{bmatrix}\).

➩ Note: p.max(axis = 0) and p.argmax(axis = 0) would compute max and argmax along the columns.

# Questions?

📗 Notes and code adapted from the course taught by Professors Gurmail Singh, Yiyin Shen, Tyler Caraza-Harter.

📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.

📗 Anonymous feedback can be submitted to: Form. Non-anonymous feedback and questions can be posted on Piazza: Link

Prev: W10, Next: W12

Last Updated: June 27, 2026 at 9:06 PM