Prev: W6, Next: W8

Zoom: Link, TopHat: Link (341925), GoogleForm: Link, Piazza: Link, Feedback: Link.


Slide:



# Margin and Support Vectors

📗 The perceptron algorithm finds any one of many linear classifier that separates two classes.
📗 Among them, the classifier with the widest margin (width of the thickest line that can separate the two classes) is call the support vector machine (the items or feature vectors on the edges of the thick line are called support vectors).
TopHat Discussion ID:
📗 [1 points] Move the line (by moving the two points on the line) so that it separates the two classes and the margin is maximized.

Margin: 0


TopHat Discussion ID:
📗 [1 points] Move the plus (blue) and minus (red) planes so that they separate the two classes and the margin is maximized.

Plane: 0
Margin: 0





# Hard Margin Support Vector Machine

📗 Mathematically, the margins can be computed by \(\dfrac{2}{\sqrt{w^\top w}}\) or \(\dfrac{2}{\sqrt{w_{1}^{2} + w_{2}^{2} + ... + w_{m}^{2}}}\), which is the distance between the two edges of the thick line (or two hyper-planes in high dimensional space), \(w^\top x_{i} + b = w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b = \pm 1\).
📗 The optimization problem is given by \(\displaystyle\max_{w} \dfrac{2}{\sqrt{w^\top w}}\) subject to \(w^\top x_{i} + b \leq -1\) if \(y_{i} = 0\) and \(w^\top x_{i} + b \geq 1\) if \(y_{i} = 1\) for \(i = 1, 2, ..., n\).
📗 The problem is equivalent to \(\displaystyle\min_{w} \dfrac{1}{2} w^\top w\) subject to \(\left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right) \geq 1\) for \(i = 1, 2, ..., n\).



# Soft Margin Support Vector Machine

📗 To allow mistakes classifying a few items (similar to logistic regression), slack variables \(\xi_{i}\) can be introduced.
📗 The problem can be modified to \(\displaystyle\min_{w} \dfrac{1}{2} w^\top w + \dfrac{1}{\lambda} \dfrac{1}{n} \left(\xi_{1} + \xi_{2} + ... + \xi_{n}\right)\) subject to \(\left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right) \geq 1 - \xi_{i}\) and \(\xi_{i} \geq 0\) for \(i = 1, 2, ..., n\).
📗 The problem is equivalent to \(\displaystyle\min_{w} \dfrac{\lambda}{2} w^\top w + \dfrac{1}{n} \left(C_{1} + C_{2} + ... + C_{n}\right)\) where \(C_{i} = \displaystyle\max\left\{0, 1 - \left(2 y_{i} - 1\right)\left(w^\top x_{i} + b\right)\right\}\).
➩ This is similar to \(L_{2}\) regularized perceptrons with hinge loss (which will be introduced in a future lecture).
TopHat Discussion ID:
📗 [1 points] Move the line (by moving the two points on the line) so that the regularized loss is minimized: margin = , average slack = , loss = , where \(\lambda\) = .

Margin: 0


TopHat Discussion ID:
📗 [1 points] Move the plus (blue) and minus (red) planes so that the regularized loss is minimized: margin = , average slack = , loss = , where \(\lambda\) = .

Plane: 0
Margin: 0





# Subgradient Descent

📗 Gradient descent can be used to choose the weights by minimizing the costs, but the hinge loss function is not differentiable at some points. At those points, sub-derivative (or sub-gradient) can be used instead.
Math Note
📗 Sub-derivative at a point is the slope of any of the tangent lines at the point.
➩ Define \(y'_{i} = 2 y_{i} - 1\) (convert \(y_{i} = 0, 1\) to \(y'_{i} = -1, 1\), subgradeint descent for soft margin support vector machine is \(w = \left(1 - \lambda\right) w - \alpha \left(C'_{1} + C'_{2} + ... + C'_{n}\right)\) where \(C'_{i} = y'_{i}\) if \(y'_{i} w^\top x_{i} \geq 1\) and \(C'_{i} = 0\) otherwise. \(b\) is usually set to 0 for support vector machines.
➩ Stochastic gradient descent \(w = \left(1 - \lambda\right) w - \alpha C'_{i}\) for \(i = 1, 2, ..., n\) for support vector machines is called PEGASOS (Primal Estimated sub-GrAdient SOlver for Svm).
TopHat Quiz (Past Exam Question) ID:
📗 [1 points] What are the smallest and largest values of subderivatives of at \(x = 0\).

📗 Answer:
Min (green line): 0
Max (blue line): 0




# Multi-Class SVM

📗 Multiple SVMs can be trained to perform multi-class classification.
➩ One-vs-one: \(\dfrac{1}{2} K \left(K - 1\right)\) classifiers (if there are \(K = 3\) classes then the classifiers are 1 vs 2, 1 vs 3, 2 vs 3.
➩ One-vs-all (or one-vs-rest): \(K\) classifiers (if there are \(K = 3\) classes then the classifiers are 1 vs not-1, 2 vs not-2, 3 vs not-3.



# Feature Map

📗 If the classes are not linearly separable, more features can be created so that in the higher dimensional space, the items might be linearly separable. This applies to perceptrons and support vector machines.
📗 Given a feature map \(\varphi\), the new items \(\left(\varphi\left(x_{i}\right), y_{i}\right)\) for \(i = 1, 2, ..., n\) can be used to train perceptrons or support vector machines.
📗 When applying the resulting classifier on a new item \(x_{i'}\), \(\varphi\left(x_{i'}\right)\) should be used as the features too.
TopHat Discussion ID:
📗 [1 points] Transform the points (using the feature map) and move the plane such that the plane separates the two classes.

Feature map scale: 0
Plane: 0




# Kernel Trick

📗 Using non-linear feature maps for support vector machines (which are linear classifiers) is called the kernel trick since any feature map on a data set can be represented by a \(n \times n\) matrix called the kernel matrix (or Gram matrix): \(K_{i i'} = \varphi\left(x_{i}\right)^\top \varphi\left(x_{i'}\right) = \varphi_{1}\left(x_{i}\right) \varphi_{1}\left(x_{i'}\right) + \varphi_{2}\left(x_{i}\right) \varphi_{2}\left(x_{i'}\right) + ... + \varphi_{m}\left(x_{i}\right) \varphi_{m}\left(x_{i'}\right)\), for \(i = 1, 2, ..., n\) and \(i' = 1, 2, ..., n\).
➩ If \(\varphi\left(x_{i}\right) = \begin{bmatrix} x_{i 1}^{2} \\ \sqrt{2} x_{i 1} x_{i 2} \\ x_{i 2}^{2} \end{bmatrix}\), then \(K_{i i'} = \left(x^\top_{i} x_{i'}\right)^{2}\).
➩ If \(\varphi\left(x_{i}\right) = \begin{bmatrix} \sqrt{2} x_{i 1} \\ x_{i 1}^{2} \\ \sqrt{2} x_{i 1} x_{i 2} \\ x_{i 2}^{2} \\ \sqrt{2} x_{i 2} \\ 1 \end{bmatrix}\), then \(K_{i i'} = \left(x^\top_{i} x_{i'} + 1\right)^{2}\).
TopHat Quiz (Past Exam Question) ID:
📗 [4 points] Consider a kernel \(K\left(x_{i_{1}}, x_{i_{2}}\right)\) = + + , where both \(x_{i_{1}}\) and \(x_{i_{2}}\) are 1D positive real numbers. What is the feature vector \(\varphi\left(x_{i}\right)\) induced by this kernel evaluated at \(x_{i}\) = ?
📗 Answer (comma separated vector): .




# Kernel Matrix

📗 A matrix is a kernel for some feature map \(\varphi\) if and only if it is symmetric positive semi-definite (positive semi-definiteness is equivalent to having non-negative eigenvalues).
📗 Some kernel matrices correspond to infinite-dimensional feature maps.
➩ Linear kernel: \(K_{i i'} = x^\top_{i} x_{i'}\).
➩ Polynomial kernel: \(K_{i i'} = \left(x^\top_{i} x_{i'} + 1\right)^{d}\).
➩ Radial basis function (Gaussian) kernel: \(K_{i i'} = e^{- \dfrac{1}{\sigma^{2}} \left(x_{i} - x_{i'}\right)^\top \left(x_{i} - x_{i'}\right)}\). In this case, the new features are infinite dimensional (any finite data set is linearly separable), and dual optimization techniques are used to find the weights (subgradient descent for the primal problem cannot be used).
Math Note
➩ The primal problem is given by \(\displaystyle\min_{w} \dfrac{\lambda}{2} w^\top w + \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} \displaystyle\max\left\{0, 1 - \left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right)\right\}\),
➩ The dual problem is given by \(\displaystyle\max_{\alpha > 0} \displaystyle\sum_{i=1}^{n} \alpha_{i} - \dfrac{1}{2} \displaystyle\sum_{i,i' = 1}^{n} \alpha_{i} \alpha_{i'} \left(2 y_{i} - 1\right) \left(2 y_{i'} - 1\right) \left(x^\top_{i} x_{i'}\right)\) subject to \(0 \leq \alpha_{i} \leq \dfrac{1}{\lambda n}\) and \(\displaystyle\sum_{i=1}^{n} \alpha_{i} \left(2 y_{i} - 1\right) = 0\).
➩ The dual problem only involves \(x^\top_{i} x_{i'}\), and with the new features, \(\varphi\left(x_{i}\right)^\top \varphi\left(x_{i'}\right)\) which are elements of the kernel matrix.
➩ The primal classifier is \(w^\top x + b\).
➩ The dual classifier is \(\displaystyle\sum_{i=1}^{n} \alpha_{i} y_{i} \left(x^\top_{i} x\right) + b\), where \(\alpha_{i} \neq 0\) only when \(x_{i}\) is a support vector.



# K Nearest Neighbor

📗 K Nearest Neighbor algorithm (not related to K Means) is a simple supervised learning algorithm that uses the \(K\) items from the training set that is the closest to a new item to predict the label of the new item: Link, Wikipedia.
➩ 1 nearest neighbor copies the label of the closest item.
➩ 3 nearest neighbor finds the majority label of the three closest items.
➩ N nearest neighbor uses the majority label of the training set (of size N) to predict the label of every new item.
📗 The distance measure used in K nearest neighbor can be any of the \(L_{p}\) distances.
➩ \(L_{1}\) Manhattan distance.
➩ \(L_{2}\) Euclidean distance.
➩ \(L_{\infty}\) Maximum distance from all features.
TopHat Discussion ID:
📗 [1 points] Find the value of K for K nearest neighbor that is the most appropriate for the dataset. Click on an existing point to perform leave-one-out cross validation, and click on a new point to find the nearest neighbor.

K:



# Training Set Accuracy

📗 For 1NN, the accuracy of the prediction on the training set is always 100 percent.
📗 When comparing the accuracy of KNN for different values of K (called hyperparameter tuning), training set accuracy is not a great meausre.
📗 K fold cross validation is often used instead to measure the performance of a supervised learning algorithm on the training set.
➩ The training set is divided into K groups (K can be different from the K in KNN).
➩ Train the model on K - 1 groups and compute the accuracy on the remaining 1 group.
➩ Repeat this process K times.
📗 K fold cross validation with \(K = n\) is called Leave One Out Cross Validation (LOOCV).
TopHat Quiz ID:
📗 [4 points] Given the following training data, what is the fold cross validation accuracy if NN (Nearest Neighbor) classifier with Manhattan distance is used. The first fold is the first instances, the second fold is the next instances, etc. Break the tie (in distance) by using the instance with the smaller index. Enter a number between 0 and 1.
\(x_{i}\)
\(y_{i}\)

📗 Answer: .




📗 Notes and code adapted from the course taught by Professors Blerina Gkotse, Jerry Zhu, Yudong Chen, Yingyu Liang, Charles Dyer.
📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.
📗 Anonymous feedback can be submitted to: Form. Non-anonymous feedback and questions can be posted on Piazza: Link

Prev: W6, Next: W8





Last Updated: March 06, 2026 at 3:28 PM