Prev: W1 Next: W3

# Summary

📗 Tuesday to Friday lectures: 1:00 to 2:15, Zoom Link
📗 Monday to Saturday office hours: Zoom Link
📗 Personal meeting room: always open, Zoom Link

📗 Math Homework:
M4, M5,
📗 Programming Homework:
P2,
📗 Examples and Quizzes:
Q5, Q6, Q7, Q8,
📗 Discussions:
D3, D4,

# Lectures

📗 Slides (will be posted before lecture, usually updated on Monday):
Blank Slides: Part 1: PDF, Part 2: PDF, Part 3: PDF, Part 4: PDF,
📗 The annotated lecture slides will not be posted this year: please copy down the notes during the lecture or from the Zoom recording.

📗 Notes
Train

Image by Vishal Arora via medium


# Other Materials

📗 Pre-recorded videos from 2020
Lecture 5 Part 1 (Support Vector Machines): Link
Lecture 5 Part 2 (Subgradient Descent): Link
Lecture 5 Part 3 (Kernel Trick): Link
Lecture 6 Part 1 (Decision Tree): Link
Lecture 6 Part 2 (Random Forrest): Link
Lecture 6 Part 3 (Nearest Neighbor): Link

Lecture 7 Part 1 (Convolution): Link
Lecture 7 Part 2 (Gradient Filters): Link
Lecture 7 Part 3 (Computer Vision): Link
Lecture 8 Part 1 (Computer Vision): Link
Lecture 8 Part 2 (Viola Jones): Link
Lecture 8 Part 3 (Convolutional Neural Net): Link


📗 Relevant websites
Support Vector Machine: Link
RBF Kernel SVM Demo: Link

Decision Tree: Link
Random Forrest Demo: Link

K Nearest Neighbor: Link
Map of Manhattan: Link
Voronoi Diagram: Link
KD Tree: Link

Image Filter: Link
Canny Edge Detection: Link
SIFT: PDF
HOG: PDF
Conv Net on MNIST: Link
Conv Net Vis: Link
LeNet: PDF, Link
Google Inception Net: PDF
CNN Architectures: Link
Image to Image: Link
Image segmentation: Link
Image colorization: Link, Link 
Image Reconstruction: Link
Style Transfer: Link
Move Mirror: Link
Pose Estimation: Link
YOLO Attack: YouTube


📗 YouTube videos from previous summers
📗 Hard Margin Support Vector Machine:
How to find the margin expression for SVM? Link
Compute SVM classifier Link
How to find the distance from a plane to a point? Link
How to find the formula for SVM given two training points? Link
What is the largest number of points that can be removed to maintain the same SVM? Link (Part 4)
What is minimum number of points that can be removed to improve the SVM margin? Link (Part 5)
How many training items are needed for a one-vs-one SVM? Link (Part 2)
Which items are used in a multi-lcass one-vs-one SVM? Link (Part 7)

📗 Soft Margin Support Vector Machine:
What is the gradient descent step for SVM hinge loss with linear activation? Link (Part 1)
How to compute the subgradient? Link (Part 2)
What happens if the lambda in soft-margin SVM is 0? Link (Part 3)
How to compute the hinge loss gradient? Link (Part 1)

📗 Kernel Trick:
Why does the kernel trick work? Link
How to find feature representation for sum of two kernel (Gram) matrices? Link
What is the kernel SVM for XOR operator? Link
How to convert the kernel matrix to feature vector? Link
How to find the kernel (Gram) matrix given the feature representation? Link (Part 1)
How to find the feature vector based on the kernel (Gram) matrix? Link (Part 4)
How to find the kernal (Gram) matrix based on the feature vectors? Link (Part 10)

📗 Entropy:
How to do entropy computation? Link
How to find the information gain given two distributions (this is the Avatar question)? Link
What distribution maximizes the entropy? Link (Part 1)
How to create a dataset with information gain of 0? Link (Part 2)
How to compute the conditional entropy based on a binary variable dataset? Link (Part 3)
How to find conditional entropy given a dataset? Link (Part 9)
When is the information gain based on a dataset equal to zero? Link (Part 10)
How to compute entropy of a binary variable? Link (Part 1)
How to compute information gain, the Avatar question? Link (Part 2)
How to compute conditional entropy based on a training set? Link (Part 3)

📗 Decision Trees:
What is the decision tree for implication operator? Link
How many conditional entropy calculations are needed for a decision tree with real-valued features? Link (Part 1)
What is the maximum and minimum training set accuracy for a decision tree? Link (Part 2)
How to find the minimum number of conditional entropies that need to be computed for a binary decision tree? Link (Part 9)
What is the maximum number of conditional entropies that need to be computed in a decision tree at a certain depth? Link (Part 4)

📗 Nearest Neighbor:
How to do three nearest neighbor 3NN? Link
How to find a KNN decision boundary? Link
What is the accuracy for KNN when K = n or K = 1? Link (Part 1)
Which K maximizes the accuracy of KNN? Link (Part 3)
How to work with KNN with distance defined on the alphabet? Link (Part 4)
How to find the 1NN accuracy on training set? Link (Part 8)
How to draw the decision boundary of 1NN in 2D? Link (Part 1)
How to find the smallest k such that all items are classified as the same label with kNN? Link (Part 2)
Which value of k maximizes the accuracy of kNN? Link (Part 3)

📗 K-Fold Validation:
How to compute the leave-one-out accuracy for kNN with large k? Link
What is the leave-one-out accuracy for KNN with K = n? Link (Part 2)
How to compute cross validation accuracy for KNN? Link (Part 5)
What is the leave-one-out accuracy for n-1-NN? Link (Part 5)
How to find the 3 fold cross validation accuracy of a 1NN classifier? Link (Part 12)

📗 Convolution and Image Gradient:
How to compute the convolution between two matrices? Link (Part 1)
How to compute the convolution between a matrix an a gradient (Sobel) filter? Link (Part 2)
How to find the 2D convolution between two matrices? Link
How to find a discrete approximate Gausian filter? Link
How to find the HOG features? Link
How to compute the gradient magnitude of pixel? Link (Part 3)
How to compute the convolution of a 2D image with a Sobel filter? Link (Part 2)
How to compute the convolution of a 2D image with a 1D gradient filter? Link (Part 8)
How to compute the convolution of a 2D image with a sparse 2D filter? Link (Part 13)
How to find the gradient magnitude using Sobel filter? Link (Part 3)
How to find the gradient direction bin? Link (Part 4)

📗 Convolutional Neural Network:
How to count the number of weights for training for a convolutional neural network (LeNet)? Link
How to find the number of weights in a CNN? Link
How to compute the activation map after a pooling layer? Link (Part 1)
How to find the number of weights in a CNN? Link (Part 2)
How to compute the activation map after a max-pooling layer? Link (Part 11)
How many weights are there in a CNN? Link (Part 11)
How to find the number of weights and biases in a CNN? Link (Part 1)
How to find the activation map after a pooling layer? Link (Part 2)



# Keywords and Notations

📗 Support Vector Machine
SVM classifier: \(\hat{y}_{i} = 1_{\left\{w^\top x_{i} + b \geq 0\right\}}\).
Hard margin, original max-margin formulation: \(\displaystyle\max_{w} \dfrac{2}{\sqrt{w^\top w}}\) such that \(w^\top x_{i} + b \leq -1\) if \(y_{i} = 0\) and \(w^\top x_{i} + b \geq 1\) if \(y_{i} = 1\).
Hard margin, simplified formulation: \(\displaystyle\min_{w} \dfrac{1}{2} w^\top w\) such that \(\left(2 y_{i} - 1\right)\left(w^\top x_{i} + b\right) \geq 1\).
Soft margin, original max-margin formulation: \(\displaystyle\min_{w} \dfrac{1}{2} w^\top w + \dfrac{1}{\lambda} \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} \xi_{i}\) such that \(\left(2 y_{i} - 1\right)\left(w^\top x_{i} + b\right) \geq 1 - \xi, \xi \geq 0\), where \(\xi_{i}\) is the slack variable for instance \(i\), \(\lambda\) is the regularization parameter.
Soft margin, simplified formulation: \(\displaystyle\min_{w} \dfrac{\lambda}{2} w^\top w + \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} \displaystyle\max\left\{0, 1 - \left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right)\right\}\)
Subgradient descent formula: \(w = \left(1 - \lambda\right) w - \alpha \left(2 y_{i} - 1\right) 1_{\left\{\left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right) \geq 1\right\}} x_{i}\).

📗 Kernel Trick
Kernel SVM classifier: \(\hat{y}_{i} = 1_{\left\{w^\top \varphi\left(x_{i}\right) + b \geq 0\right\}}\), where \(\varphi\) is the feature map.
Kernal Gram matrix: \(K_{i i'} = \varphi\left(x_{i}\right)^\top \varphi\left(x_{i'}\right)\).
Quadratic Kernel: \(K_{i i'} = \left(x_{i^\top} x_{i'} + 1\right)^{2}\) has feature representation \(\varphi\left(x_{i}\right) = \left(x_{i1}^{2}, x_{i2}^{2}, \sqrt{2} x_{i1} x_{i2}, \sqrt{2} x_{i1}, \sqrt{2} x_{i2}, 1\right)\).
Gaussian RBF Kernel: \(K_{i i'} = \exp\left(- \dfrac{1}{2 \sigma^{2}} \left(x_{i} - x_{i'}\right)^\top \left(x_{i} - x_{i'}\right)\right)\) has infinite-dimensional feature representation, where \(\sigma^{2}\) is the variance parameter.

📗 Information Theory:
Entropy: \(H\left(Y\right) = -\displaystyle\sum_{y=1}^{K} p_{y} \log_{2} \left(p_{y}\right)\), where \(K\) is the number of classes (number of possible labels), \(p_{y}\) is the fraction of data points with label \(y\).
Conditional entropy: \(H\left(Y | X\right) = -\displaystyle\sum_{x=1}^{K_{X}} p_{x} \displaystyle\sum_{y=1}^{K} p_{y|x} \log_{2} \left(p_{y|x}\right)\), where \(K_{X}\) is the number of possible values of feature, \(p_{x}\) is the fraction of data points with feature \(x\), \(p_{y|x}\) is the fraction of data points with label \(y\) among the ones with feature \(x\).
Information gain, for feature \(j\): \(I\left(Y | X_{j}\right) = H\left(Y\right) - H\left(Y | X_{j}\right)\).

📗 Decision Tree:
Decision stump classifier: \(\hat{y}_{i} = 1_{\left\{x_{ij} \geq t_{j}\right\}}\), where \(t_{j}\) is the threshold for feature \(j\).
Feature selection: \(j^\star = \mathop{\mathrm{argmax}}_{j} I\left(Y | X_{j}\right)\).

📗 Convolution
Convolution (1D): \(a = x \star w\), \(a_{j} = \displaystyle\sum_{t=-k}^{k} w_{t} x_{j-t}\), where \(w\) is the filter, and \(k\) is half of the width of the filter.
Convolution (2D): \(A = X \star W\), \(A_{j j'} = \displaystyle\sum_{s=-k}^{k} \displaystyle\sum_{t=-k}^{k} W_{s,t} X_{j-s,j'-t}\), where \(W\) is the filter, and \(k\) is half of the width of the filter.
Sobel filter: \(W_{x} = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}\) and \(W_{y} = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix}\).
Image gradient: \(\nabla_{x} X = W_{x} \star X\), \(\nabla_{y} X = W_{y} \star X\), with gradient magnitude \(G = \sqrt{\nabla_{x}^{2} + \nabla_{y}^{2}}\) and gradient direction \(\Theta = arctan\left(\dfrac{\nabla_{y}}{\nabla_{x}}\right)\).

📗 Convolutional Neural Network
Fully connected layer: \(a = g\left(w^\top x + b\right)\), where \(a\) is the activation unit, \(g\) is the activation function.
Convolution layer: \(A = g\left(W \star X + b\right)\), where \(A\) is the activation map.
Pooling layer: (max-pooling) \(a = \displaystyle\max\left\{x_{1}, ..., x_{m}\right\}\), (average-pooling) \(a = \dfrac{1}{m} \displaystyle\sum_{j=1}^{m} x_{j}\).






Last Updated: November 30, 2024 at 4:34 AM