# Other Materials
📗 Pre-recorded videos from 2020 
Lecture 5 Part 1 (Support Vector Machines): 
Link 
Lecture 5 Part 2 (Subgradient Descent): 
Link 
Lecture 5 Part 3 (Kernel Trick): 
Link 
Lecture 6 Part 1 (Decision Tree): 
Link 
Lecture 6 Part 2 (Random Forrest): 
Link 
Lecture 6 Part 3 (Nearest Neighbor): 
Link 
Lecture 7 Part 1 (Convolution): 
Link 
Lecture 7 Part 2 (Gradient Filters): 
Link 
Lecture 7 Part 3 (Computer Vision): 
Link 
Lecture 8 Part 1 (Computer Vision): 
Link 
Lecture 8 Part 2 (Viola Jones): 
Link 
Lecture 8 Part 3 (Convolutional Neural Net): 
Link 
📗 Relevant websites 
Support Vector Machine: 
Link 
RBF Kernel SVM Demo: 
Link 
Decision Tree: 
Link 
Random Forrest Demo: 
Link 
K Nearest Neighbor: 
Link 
Map of Manhattan: 
Link 
Voronoi Diagram: 
Link 
KD Tree: 
Link 
Image Filter: 
Link 
Canny Edge Detection: 
Link 
SIFT: 
PDF 
HOG: 
PDF 
Conv Net on MNIST: 
Link 
Conv Net Vis: 
Link 
LeNet: 
PDF, 
Link 
Google Inception Net: 
PDF 
CNN Architectures: 
Link 
Image to Image: 
Link 
Image segmentation: 
Link 
Image colorization: 
Link, 
Link  
Image Reconstruction: 
Link 
Style Transfer: 
Link 
Move Mirror: 
Link 
Pose Estimation: 
Link 
YOLO Attack: 
YouTube 
📗 YouTube videos from previous summers 
📗 Hard Margin Support Vector Machine: 
How to find the margin expression for SVM? 
Link 
Compute SVM classifier 
Link 
How to find the distance from a plane to a point? 
Link 
How to find the formula for SVM given two training points? 
Link 
What is the largest number of points that can be removed to maintain the same SVM? 
Link (Part 4) 
What is minimum number of points that can be removed to improve the SVM margin? 
Link (Part 5) 
How many training items are needed for a one-vs-one SVM? 
Link (Part 2) 
Which items are used in a multi-lcass one-vs-one SVM? 
Link (Part 7) 
📗 Soft Margin Support Vector Machine: 
What is the gradient descent step for SVM hinge loss with linear activation? 
Link (Part 1) 
How to compute the subgradient? 
Link (Part 2) 
What happens if the lambda in soft-margin SVM is 0? 
Link (Part 3) 
How to compute the hinge loss gradient? 
Link (Part 1) 
📗 Kernel Trick: 
Why does the kernel trick work? 
Link 
How to find feature representation for sum of two kernel (Gram) matrices? 
Link 
What is the kernel SVM for XOR operator? 
Link 
How to convert the kernel matrix to feature vector? 
Link 
How to find the kernel (Gram) matrix given the feature representation? 
Link (Part 1) 
How to find the feature vector based on the kernel (Gram) matrix? 
Link (Part 4) 
How to find the kernal (Gram) matrix based on the feature vectors? 
Link (Part 10) 
📗 Entropy: 
How to do entropy computation? 
Link 
How to find the information gain given two distributions (this is the Avatar question)? 
Link 
What distribution maximizes the entropy? 
Link (Part 1) 
How to create a dataset with information gain of 0? 
Link (Part 2) 
How to compute the conditional entropy based on a binary variable dataset? 
Link (Part 3) 
How to find conditional entropy given a dataset? 
Link (Part 9) 
When is the information gain based on a dataset equal to zero? 
Link (Part 10) 
How to compute entropy of a binary variable? 
Link (Part 1) 
How to compute information gain, the Avatar question? 
Link (Part 2) 
How to compute conditional entropy based on a training set? 
Link (Part 3) 
📗 Decision Trees: 
What is the decision tree for implication operator? 
Link 
How many conditional entropy calculations are needed for a decision tree with real-valued features? 
Link (Part 1) 
What is the maximum and minimum training set accuracy for a decision tree? 
Link (Part 2) 
How to find the minimum number of conditional entropies that need to be computed for a binary decision tree? 
Link (Part 9) 
What is the maximum number of conditional entropies that need to be computed in a decision tree at a certain depth? 
Link (Part 4) 
📗 Nearest Neighbor: 
How to do three nearest neighbor 3NN? 
Link 
How to find a KNN decision boundary? 
Link 
What is the accuracy for KNN when K = n or K = 1? 
Link (Part 1) 
Which K maximizes the accuracy of KNN? 
Link (Part 3) 
How to work with KNN with distance defined on the alphabet? 
Link (Part 4) 
How to find the 1NN accuracy on training set? 
Link (Part 8) 
How to draw the decision boundary of 1NN in 2D? 
Link (Part 1) 
How to find the smallest k such that all items are classified as the same label with kNN? 
Link (Part 2) 
Which value of k maximizes the accuracy of kNN? 
Link (Part 3) 
📗 K-Fold Validation: 
How to compute the leave-one-out accuracy for kNN with large k? 
Link 
What is the leave-one-out accuracy for KNN with K = n? 
Link (Part 2) 
How to compute cross validation accuracy for KNN? 
Link (Part 5) 
What is the leave-one-out accuracy for n-1-NN? 
Link (Part 5) 
How to find the 3 fold cross validation accuracy of a 1NN classifier? 
Link (Part 12) 
📗 Convolution and Image Gradient: 
How to compute the convolution between two matrices? 
Link (Part 1) 
How to compute the convolution between a matrix an a gradient (Sobel) filter? 
Link (Part 2) 
How to find the 2D convolution between two matrices? 
Link 
How to find a discrete approximate Gausian filter? 
Link 
How to find the HOG features? 
Link 
How to compute the gradient magnitude of pixel? 
Link (Part 3) 
How to compute the convolution of a 2D image with a Sobel filter? 
Link (Part 2) 
How to compute the convolution of a 2D image with a 1D gradient filter? 
Link (Part 8) 
How to compute the convolution of a 2D image with a sparse 2D filter? 
Link (Part 13) 
How to find the gradient magnitude using Sobel filter? 
Link (Part 3) 
How to find the gradient direction bin? 
Link (Part 4) 
📗 Convolutional Neural Network: 
How to count the number of weights for training for a convolutional neural network (LeNet)? 
Link 
How to find the number of weights in a CNN? 
Link 
How to compute the activation map after a pooling layer? 
Link (Part 1) 
How to find the number of weights in a CNN? 
Link (Part 2) 
How to compute the activation map after a max-pooling layer? 
Link (Part 11) 
How many weights are there in a CNN? 
Link (Part 11) 
How to find the number of weights and biases in a CNN? 
Link (Part 1) 
How to find the activation map after a pooling layer? 
Link (Part 2) 
 
# Keywords and Notations
📗 Support Vector Machine 
SVM classifier: \(\hat{y}_{i} = 1_{\left\{w^\top x_{i} + b \geq 0\right\}}\). 
Hard margin, original max-margin formulation: \(\displaystyle\max_{w} \dfrac{2}{\sqrt{w^\top w}}\) such that \(w^\top x_{i} + b \leq -1\) if \(y_{i} = 0\) and \(w^\top x_{i} + b \geq 1\) if \(y_{i} = 1\). 
Hard margin, simplified formulation: \(\displaystyle\min_{w} \dfrac{1}{2} w^\top w\) such that \(\left(2 y_{i} - 1\right)\left(w^\top x_{i} + b\right) \geq 1\). 
Soft margin, original max-margin formulation: \(\displaystyle\min_{w} \dfrac{1}{2} w^\top w + \dfrac{1}{\lambda} \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} \xi_{i}\) such that \(\left(2 y_{i} - 1\right)\left(w^\top x_{i} + b\right) \geq 1 - \xi, \xi \geq 0\), where \(\xi_{i}\) is the slack variable for instance \(i\), \(\lambda\) is the regularization parameter. 
Soft margin, simplified formulation: \(\displaystyle\min_{w} \dfrac{\lambda}{2} w^\top w + \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} \displaystyle\max\left\{0, 1 - \left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right)\right\}\) 
Subgradient descent formula: \(w = \left(1 - \lambda\right) w - \alpha \left(2 y_{i} - 1\right) 1_{\left\{\left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right) \geq 1\right\}} x_{i}\). 
📗 Kernel Trick 
Kernel SVM classifier: \(\hat{y}_{i} = 1_{\left\{w^\top \varphi\left(x_{i}\right) + b \geq 0\right\}}\), where \(\varphi\) is the feature map. 
Kernal Gram matrix: \(K_{i i'} = \varphi\left(x_{i}\right)^\top \varphi\left(x_{i'}\right)\). 
Quadratic Kernel: \(K_{i i'} = \left(x_{i^\top} x_{i'} + 1\right)^{2}\) has feature representation \(\varphi\left(x_{i}\right) = \left(x_{i1}^{2}, x_{i2}^{2}, \sqrt{2} x_{i1} x_{i2}, \sqrt{2} x_{i1}, \sqrt{2} x_{i2}, 1\right)\). 
Gaussian RBF Kernel: \(K_{i i'} = \exp\left(- \dfrac{1}{2 \sigma^{2}} \left(x_{i} - x_{i'}\right)^\top \left(x_{i} - x_{i'}\right)\right)\) has infinite-dimensional feature representation, where \(\sigma^{2}\) is the variance parameter. 
📗 Information Theory: 
Entropy: \(H\left(Y\right) = -\displaystyle\sum_{y=1}^{K} p_{y} \log_{2} \left(p_{y}\right)\), where \(K\) is the number of classes (number of possible labels), \(p_{y}\) is the fraction of data points with label \(y\). 
Conditional entropy: \(H\left(Y | X\right) = -\displaystyle\sum_{x=1}^{K_{X}} p_{x} \displaystyle\sum_{y=1}^{K} p_{y|x} \log_{2} \left(p_{y|x}\right)\), where \(K_{X}\) is the number of possible values of feature, \(p_{x}\) is the fraction of data points with feature \(x\), \(p_{y|x}\) is the fraction of data points with label \(y\) among the ones with feature \(x\). 
Information gain, for feature \(j\): \(I\left(Y | X_{j}\right) = H\left(Y\right) - H\left(Y | X_{j}\right)\). 
📗 Decision Tree: 
Decision stump classifier: \(\hat{y}_{i} = 1_{\left\{x_{ij} \geq t_{j}\right\}}\), where \(t_{j}\) is the threshold for feature \(j\). 
Feature selection: \(j^\star = \mathop{\mathrm{argmax}}_{j} I\left(Y | X_{j}\right)\). 
📗 Convolution 
Convolution (1D): \(a = x \star w\), \(a_{j} = \displaystyle\sum_{t=-k}^{k} w_{t} x_{j-t}\), where \(w\) is the filter, and \(k\) is half of the width of the filter. 
Convolution (2D): \(A = X \star W\), \(A_{j j'} = \displaystyle\sum_{s=-k}^{k} \displaystyle\sum_{t=-k}^{k} W_{s,t} X_{j-s,j'-t}\), where \(W\) is the filter, and \(k\) is half of the width of the filter. 
Sobel filter: \(W_{x} = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}\) and \(W_{y} = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix}\). 
Image gradient: \(\nabla_{x} X = W_{x} \star X\), \(\nabla_{y} X = W_{y} \star X\), with gradient magnitude \(G = \sqrt{\nabla_{x}^{2} + \nabla_{y}^{2}}\) and gradient direction \(\Theta = arctan\left(\dfrac{\nabla_{y}}{\nabla_{x}}\right)\). 
📗 Convolutional Neural Network 
Fully connected layer: \(a = g\left(w^\top x + b\right)\), where \(a\) is the activation unit, \(g\) is the activation function. 
Convolution layer: \(A = g\left(W \star X + b\right)\), where \(A\) is the activation map. 
Pooling layer: (max-pooling) \(a = \displaystyle\max\left\{x_{1}, ..., x_{m}\right\}\), (average-pooling) \(a = \dfrac{1}{m} \displaystyle\sum_{j=1}^{m} x_{j}\).