Young Wu's Homepage

Prev: L4, Next: L6

Zoom: Link, Piazza: Link, Google Form: Link.

Wisc ID for in-class quiz: (if your wisc email is "test@wisc.edu", please enter "test")
Token: (will be given during the lectures)

Slide:

# Margin and Support Vectors

📗 The perceptron algorithm finds any one of many linear classifier that separates two classes.

📗 Among them, the classifier with the widest margin (width of the thickest line that can separate the two classes) is call the support vector machine (the items or feature vectors on the edges of the thick line are called support vectors).

In-class Discussion

ID:

📗 [1 points] Move the line (by moving the two points on the line) so that it separates the two classes and the margin is maximized.

Margin: 0

[Note] How to find the line with the widest margin?

[Q1] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

In-class Discussion

ID:

📗 [1 points] Move the plus (blue) and minus (red) planes so that they separate the two classes and the margin is maximized.

Plane: 0
Margin: 0

[Note] How to find the plane with the widest margin?

[Q2] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

# Hard Margin Support Vector Machine

📗 Mathematically, the margins can be computed by \(\dfrac{2}{\sqrt{w^\top w}}\) or \(\dfrac{2}{\sqrt{w_{1}^{2} + w_{2}^{2} + ... + w_{m}^{2}}}\), which is the distance between the two edges of the thick line (or two hyper-planes in high dimensional space), \(w^\top x_{i} + b = w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b = \pm 1\).

📗 The optimization problem is given by \(\displaystyle\max_{w} \dfrac{2}{\sqrt{w^\top w}}\) subject to \(w^\top x_{i} + b \leq -1\) if \(y_{i} = 0\) and \(w^\top x_{i} + b \geq 1\) if \(y_{i} = 1\) for \(i = 1, 2, ..., n\).

📗 The problem is equivalent to \(\displaystyle\min_{w} \dfrac{1}{2} w^\top w\) subject to \(\left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right) \geq 1\) for \(i = 1, 2, ..., n\).

# Soft Margin Support Vector Machine

📗 To allow mistakes classifying a few items (similar to logistic regression), slack variables \(\xi_{i}\) can be introduced.

📗 The problem can be modified to \(\displaystyle\min_{w} \dfrac{1}{2} w^\top w + \dfrac{1}{\lambda} \dfrac{1}{n} \left(\xi_{1} + \xi_{2} + ... + \xi_{n}\right)\) subject to \(\left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right) \geq 1 - \xi_{i}\) and \(\xi_{i} \geq 0\) for \(i = 1, 2, ..., n\).

📗 The problem is equivalent to \(\displaystyle\min_{w} \dfrac{\lambda}{2} w^\top w + \dfrac{1}{n} \left(C_{1} + C_{2} + ... + C_{n}\right)\) where \(C_{i} = \displaystyle\max\left\{0, 1 - \left(2 y_{i} - 1\right)\left(w^\top x_{i} + b\right)\right\}\).

➩ This is similar to \(L_{2}\) regularized perceptrons with hinge loss (which will be introduced in a future lecture).

In-class Discussion

ID:

📗 [1 points] Move the line (by moving the two points on the line) so that the regularized loss is minimized: margin = , average slack = , loss = , where \(\lambda\) = .

Margin: 0

[Note] How did you find the line with the smallest total loss?

[Q3] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Other students' answers:

In-class Discussion

ID:

📗 [1 points] Move the plus (blue) and minus (red) planes so that the regularized loss is minimized: margin = , average slack = , loss = , where \(\lambda\) = .

Plane: 0
Margin: 0

[Note] How did you find the line with the smallest total loss?

[Q4] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Other students' answers:

# Subgradient Descent

📗 Gradient descent can be used to choose the weights by minimizing the costs, but the hinge loss function is not differentiable at some points. At those points, sub-derivative (or sub-gradient) can be used instead.

Math Note

📗 Sub-derivative at a point is the slope of any of the tangent lines at the point.

➩ Define \(y'_{i} = 2 y_{i} - 1\) (convert \(y_{i} = 0, 1\) to \(y'_{i} = -1, 1\), subgradeint descent for soft margin support vector machine is \(w = \left(1 - \lambda\right) w - \alpha \left(C'_{1} + C'_{2} + ... + C'_{n}\right)\) where \(C'_{i} = y'_{i}\) if \(y'_{i} w^\top x_{i} \geq 1\) and \(C'_{i} = 0\) otherwise. \(b\) is usually set to 0 for support vector machines.

➩ Stochastic gradient descent \(w = \left(1 - \lambda\right) w - \alpha C'_{i}\) for \(i = 1, 2, ..., n\) for support vector machines is called PEGASOS (Primal Estimated sub-GrAdient SOlver for Svm).

In-class Quiz

ID:

📗 [1 points] What are the smallest and largest values of subderivatives of at \(x = 0\).

📗 Answer:

Min (green line): 0
Max (blue line): 0

[Q5] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Other students' answers:

# Multi-Class SVM

📗 Multiple SVMs can be trained to perform multi-class classification.

➩ One-vs-one: \(\dfrac{1}{2} K \left(K - 1\right)\) classifiers (if there are \(K = 3\) classes then the classifiers are 1 vs 2, 1 vs 3, 2 vs 3.

➩ One-vs-all (or one-vs-rest): \(K\) classifiers (if there are \(K = 3\) classes then the classifiers are 1 vs not-1, 2 vs not-2, 3 vs not-3.

# Feature Map

📗 If the classes are not linearly separable, more features can be created so that in the higher dimensional space, the items might be linearly separable. This applies to perceptrons and support vector machines.

📗 Given a feature map \(\varphi\), the new items \(\left(\varphi\left(x_{i}\right), y_{i}\right)\) for \(i = 1, 2, ..., n\) can be used to train perceptrons or support vector machines.

📗 When applying the resulting classifier on a new item \(x_{i'}\), \(\varphi\left(x_{i'}\right)\) should be used as the features too.

In-class Discussion

ID:

📗 [1 points] Transform the points (using the feature map) and move the plane such that the plane separates the two classes.

Feature map scale: 0
Plane: 0

[Note] In general how to find the new feature maps? How are the feature maps related to hidden units in neural networks?

[Q6] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Other students' answers:

# Kernel Trick

📗 Using non-linear feature maps for support vector machines (which are linear classifiers) is called the kernel trick since any feature map on a data set can be represented by a \(n \times n\) matrix called the kernel matrix (or Gram matrix): \(K_{i i'} = \varphi\left(x_{i}\right)^\top \varphi\left(x_{i'}\right) = \varphi_{1}\left(x_{i}\right) \varphi_{1}\left(x_{i'}\right) + \varphi_{2}\left(x_{i}\right) \varphi_{2}\left(x_{i'}\right) + ... + \varphi_{m}\left(x_{i}\right) \varphi_{m}\left(x_{i'}\right)\), for \(i = 1, 2, ..., n\) and \(i' = 1, 2, ..., n\).

➩ If \(\varphi\left(x_{i}\right) = \begin{bmatrix} x_{i 1}^{2} \\ \sqrt{2} x_{i 1} x_{i 2} \\ x_{i 2}^{2} \end{bmatrix}\), then \(K_{i i'} = \left(x^\top_{i} x_{i'}\right)^{2}\).

➩ If \(\varphi\left(x_{i}\right) = \begin{bmatrix} \sqrt{2} x_{i 1} \\ x_{i 1}^{2} \\ \sqrt{2} x_{i 1} x_{i 2} \\ x_{i 2}^{2} \\ \sqrt{2} x_{i 2} \\ 1 \end{bmatrix}\), then \(K_{i i'} = \left(x^\top_{i} x_{i'} + 1\right)^{2}\).

In-class Quiz

ID:

📗 [4 points] Consider a kernel \(K\left(x_{i_{1}}, x_{i_{2}}\right)\) = + + , where both \(x_{i_{1}}\) and \(x_{i_{2}}\) are 1D positive real numbers. What is the feature vector \(\varphi\left(x_{i}\right)\) induced by this kernel evaluated at \(x_{i}\) = ?

📗 Answer (comma separated vector): .

[Note] Use the space to explain the steps or just take notes:

# Kernel Matrix

📗 A matrix is a kernel for some feature map \(\varphi\) if and only if it is symmetric positive semi-definite (positive semi-definiteness is equivalent to having non-negative eigenvalues).

📗 Some kernel matrices correspond to infinite-dimensional feature maps.

➩ Linear kernel: \(K_{i i'} = x^\top_{i} x_{i'}\).

➩ Polynomial kernel: \(K_{i i'} = \left(x^\top_{i} x_{i'} + 1\right)^{d}\).

➩ Radial basis function (Gaussian) kernel: \(K_{i i'} = e^{- \dfrac{1}{\sigma^{2}} \left(x_{i} - x_{i'}\right)^\top \left(x_{i} - x_{i'}\right)}\). In this case, the new features are infinite dimensional (any finite data set is linearly separable), and dual optimization techniques are used to find the weights (subgradient descent for the primal problem cannot be used).

Math Note

➩ The primal problem is given by \(\displaystyle\min_{w} \dfrac{\lambda}{2} w^\top w + \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} \displaystyle\max\left\{0, 1 - \left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right)\right\}\),

➩ The dual problem is given by \(\displaystyle\max_{\alpha > 0} \displaystyle\sum_{i=1}^{n} \alpha_{i} - \dfrac{1}{2} \displaystyle\sum_{i,i' = 1}^{n} \alpha_{i} \alpha_{i'} \left(2 y_{i} - 1\right) \left(2 y_{i'} - 1\right) \left(x^\top_{i} x_{i'}\right)\) subject to \(0 \leq \alpha_{i} \leq \dfrac{1}{\lambda n}\) and \(\displaystyle\sum_{i=1}^{n} \alpha_{i} \left(2 y_{i} - 1\right) = 0\).

➩ The dual problem only involves \(x^\top_{i} x_{i'}\), and with the new features, \(\varphi\left(x_{i}\right)^\top \varphi\left(x_{i'}\right)\) which are elements of the kernel matrix.

➩ The primal classifier is \(w^\top x + b\).

➩ The dual classifier is \(\displaystyle\sum_{i=1}^{n} \alpha_{i} y_{i} \left(x^\top_{i} x\right) + b\), where \(\alpha_{i} \neq 0\) only when \(x_{i}\) is a support vector.

# Questions?

📗 If you have questions, please use (i) Zoom chat, (ii) Piazza: Link, (iii) Office hours and discussion sessions. Please do NOT use Canvas mail and use email only to the course instructor (not TAs) for grading issues.

Additional In-class Discussion

📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:

Notes (not visible to other students):
[Q7] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Submit your answer to see other students answers (click the submit button to refresh):

Additional In-class Quiz

📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:

A.
B.
C.
D.
E.
Notes (not visible to other students):
[Q8] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Submit your answer to see other students answers (click the submit button to refresh):

# In-class Quiz Instructions

📗 To get full points on the in-class quizzes for a lecture:

➩ Submit relevant answers to the questions discussed during the lecture: incorrect answers are okay.

➩ Some questions require [notes] to earn the point.

➩ Some questions require special ID (given during the lecture) to earn the point.

➩ Do not submit answers to questions that are not discussed during the lectures. Each such submission will result in a deduction of one point.

➩ Submissions after the lecture, before the midterm (first 14 lectures) and the final exam (last 14 lectures), are accepted. After the exams, no in-class quiz submissions will be accepted.

➩ The grade on Canvas Assignment Q5 is computed as number of points divided by the number of questions asked (out of 1) and updated on Canvas every weekend.

📗 If there are any issues with submission on the website, please use this Google form: Link.

📗 Bonus point opportunities during a few lectures (added to in-class quiz above 20 points).

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Blerina Gkotse, Yudong Chen, Yingyu Liang, Charles Dyer. Some content are generated using Copilot .

Prev: L4, Next: L6

Last Updated: July 16, 2026 at 12:17 PM