Young Wu's Homepage

Prev: L7, Next: L9
Course Links: Canvas, Piazza, TopHat (212925)
Zoom Links: MW 4:00, TR 1:00, TR 2:30.

Tools

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 For visibility, you can resize all diagrams on page to have a maximum height that is percent of the screen height: .

📗 Calculator:

📗 Canvas:

Slide:

# Classification vs Regression

📗 Classification is the problem when \(y\) is categorical.

➩ When \(y = \left\{0, 1\right\}\), the problem is binary classification.

➩ When \(y = \left\{0, 1, ..., K\right\}\), the problem is multi-class classification.

📗 Regression is the problem when \(y\) is continuous.

➩ Logistic regression is usually used for classification problems, but since it predicts a continuous \(y\) in \(\left[0, 1\right]\), or the probability that \(y\) is in class \(1\), it is called "regression".

# Linear Regression

📗 The regression coefficients are usually estimated by \(w = \left(X^\top X\right)^{-1} X^\top y\), where \(X\) is the design matrix whose rows are the items and the columns are the features (a column of \(1\)s can be added so that the corresponding weight is the bias): Wikipedia.

📗 Gradient descent can be used with squared loss and the weights should converge to the same estimates: \(w = w - \alpha \left(\left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\right)\) and \(b = b - \alpha \left(\left(a_{1} - y_{1}\right) + \left(a_{2} - y_{2}\right) + ... + \left(a_{n} - y_{n}\right)\right)\).

Math Note

📗 The coefficients are sometimes derived from \(y = X w\), implying \(X^\top y = X^\top X w\) or \(w = \left(X^\top X\right)^{-1} X^\top y\).

📗 Gradient descent step is given by \(C_{i} = \dfrac{1}{2} \left(a_{i} - y_{i}\right)^{2}\) and \(a_{i} = w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m}\), so,

\(\dfrac{\partial C_{i}}{\partial w_{j}} = \dfrac{\partial C_{i}}{\partial a_{i}} \dfrac{\partial a_{i}}{\partial w_{j}}\)
\(= \left(a_{i} - y_{i}\right) x_{ij}\),
combining the \(w_{j}\) for \(j = 1, 2, ..., m\), \(\nabla_{w} C_{i} = \left(a_{i} - y_{i}\right) x_{i}\),
and combining the \(C_{i}\) for \(i = 1, 2, ..., n\), \(\nabla_{w} C = \left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\).

TopHat Discussion

ID:

📗 [1 points] Change the regression coefficients to minimize the loss.

Weight: 0
Bias: 0
Loss: ???

# Linear Probability Model

📗 Using linear regression to estimate probabilities is inappropriate, since the loss \(\dfrac{1}{2} \left(a_{i} - y_{i}\right)^{2}\) is large when \(y_{i} = 1\) and \(a_{i} > 1\) or \(y_{i} = 0\) and \(a_{i} < 0\), so linear regression is penalizing predictions that are "very correct": Wikipedia.

📗 Using linear regression and rounding \(y\) to the nearest integer is also inappropriate for multi-class classification, since the classes should not be ordered by their labels.

TopHat Discussion

ID:

📗 [1 points] Change the regression coefficients to minimize the loss.

Weight: 0
Bias: 0
Loss: ???

# Model Interpretation

📗 Given a new item \(x_{i'}\) (indexed by \(i'\)) with features \(\left(x_{i' 1}, x_{i' 2}, ..., x_{i' m}\right)\), the predicted \(y_{i'}\) is given by \(\hat{y}_{i'} = w_{1} x_{i' 1} + w_{2} x_{i' 2} + ... + w_{m} x_{i' m} + b\).

📗 The weight (coefficient) for feature \(j\) is usually interpreted as the expected (average) change in \(y_{i'}\) when \(x_{i' j}\) increases by one unit with the other features held constant.

📗 The bias (intercept) is usually interpreted as the expected (average) value of \(y\) when all features have value \(0\), or \(x_{i' 1} = x_{i' 2} = ... = x_{i' m} = 0\).

➩ This interpretation assumes that \(0\) is a valid value for all features (or \(0\) is in the range of all features).

# Margin and Support Vectors

📗 The perceptron algorithm finds any one of many linear classifier that separates two classes.

📗 Among them, the classifier with the widest margin (width of the thickest line that can separate the two classes) is call the support vector machine (the items or feature vectors on the edges of the thick line are called support vectors).

TopHat Discussion

ID:

📗 [1 points] Move the line (by moving the two points on the line) so that it separates the two classes and the margin is maximized.

Margin: 0

TopHat Discussion

ID:

📗 [1 points] Move the plus (blue) and minus (red) planes so that they separate the two classes and the margin is maximized.

Plane: 0
Margin: 0

# Hard Margin Support Vector Machine

📗 Mathematically, the margins can be computed by \(\dfrac{2}{\sqrt{w^\top w}}\) or \(\dfrac{2}{\sqrt{w_{1}^{2} + w_{2}^{2} + ... + w_{m}^{2}}}\), which is the distance between the two edges of the thick line (or two hyper-planes in high dimensional space), \(w^\top x_{i} + b = w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b = \pm 1\).

📗 The optimization problem is given by \(\displaystyle\max_{w} \dfrac{2}{\sqrt{w^\top w}}\) subject to \(w^\top x_{i} + b \leq -1\) if \(y_{i} = 0\) and \(w^\top x_{i} + b \geq 1\) if \(y_{i} = 1\) for \(i = 1, 2, ..., n\).

📗 The problem is equivalent to \(\displaystyle\min_{w} \dfrac{1}{2} w^\top w\) subject to \(\left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right) \geq 1\) for \(i = 1, 2, ..., n\).

# Soft Margin Support Vector Machine

📗 To allow mistakes classifying a few items (similar to logistic regression), slack variables \(\xi_{i}\) can be introduced.

📗 The problem can be modified to \(\displaystyle\min_{w} \dfrac{1}{2} w^\top w + \dfrac{1}{\lambda} \dfrac{1}{n} \left(\xi_{1} + \xi_{2} + ... + \xi_{n}\right)\) subject to \(\left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right) \geq 1 - \xi_{i}\) and \(\xi_{i} \geq 0\) for \(i = 1, 2, ..., n\).

📗 The problem is equivalent to \(\displaystyle\min_{w} \dfrac{\lambda}{2} w^\top w + \dfrac{1}{n} \left(C_{1} + C_{2} + ... + C_{n}\right)\) where \(C_{i} = \displaystyle\max\left\{0, 1 - \left(2 y_{i} - 1\right)\left(w^\top x_{i} + b\right)\right\}\).

➩ This is similar to \(L_{2}\) regularized perceptrons with hinge loss (which will be introduced in a future lecture).

TopHat Discussion

ID:

📗 [1 points] Move the line (by moving the two points on the line) so that the regularized loss is minimized: margin = , average slack = , loss = , where \(\lambda\) = .

Margin: 0

TopHat Discussion

ID:

📗 [1 points] Move the plus (blue) and minus (red) planes so that the regularized loss is minimized: margin = , average slack = , loss = , where \(\lambda\) = .

Plane: 0
Margin: 0

# Subgradient Descent

📗 Gradient descent can be used to choose the weights by minimizing the costs, but the hinge loss function is not differentiable at some points. At those points, sub-derivative (or sub-gradient) can be used instead.

Math Note

📗 Sub-derivative at a point is the slope of any of the tangent lines at the point.

➩ Define \(y'_{i} = 2 y_{i} - 1\) (convert \(y_{i} = 0, 1\) to \(y'_{i} = -1, 1\), subgradeint descent for soft margin support vector machine is \(w = \left(1 - \lambda\right) w - \alpha \left(C'_{1} + C'_{2} + ... + C'_{n}\right)\) where \(C'_{i} = y'_{i}\) if \(y'_{i} w^\top x_{i} \geq 1\) and \(C'_{i} = 0\) otherwise. \(b\) is usually set to 0 for support vector machines.

➩ Stochastic gradient descent \(w = \left(1 - \lambda\right) w - \alpha C'_{i}\) for \(i = 1, 2, ..., n\) for support vector machines is called PEGASOS (Primal Estimated sub-GrAdient SOlver for Svm).

TopHat Quiz

(Past Exam Question) ID:

📗 [1 points] What are the smallest and largest values of subderivatives of at \(x = 0\).

📗 Answer:

Min (green line): 0
Max (blue line): 0

# Multi-Class SVM

📗 Multiple SVMs can be trained to perform multi-class classification.

➩ One-vs-one: \(\dfrac{1}{2} K \left(K - 1\right)\) classifiers (if there are \(K = 3\) classes then the classifiers are 1 vs 2, 1 vs 3, 2 vs 3.

➩ One-vs-all (or one-vs-rest): \(K\) classifiers (if there are \(K = 3\) classes then the classifiers are 1 vs not-1, 2 vs not-2, 3 vs not-3.

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yingyu Liang, and Charles Dyer.

📗 Content from note blocks marked "optional" and content from Wikipedia and other demo links are helpful for understanding the materials, but will not be explicitly tested on the exams.

📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.

📗 Anonymous feedback can be submitted to: Form.

Prev: L7, Next: L9

Last Updated: July 01, 2025 at 1:47 AM