# Other Materials
📗 Pre-recorded Videos from 2020
Lecture 5 Part 1 (Support Vector Machines):
Link
Lecture 5 Part 2 (Subgradient Descent):
Link
Lecture 5 Part 3 (Kernel Trick):
Link
Lecture 6 Part 1 (Decision Tree):
Link
Lecture 6 Part 2 (Random Forrest):
Link
Lecture 6 Part 3 (Nearest Neighbor):
Link
Lecture 7 Part 1 (Convolution):
Link
Lecture 7 Part 2 (Gradient Filters):
Link
Lecture 7 Part 3 (Computer Vision):
Link
Lecture 8 Part 1 (Computer Vision):
Link
Lecture 8 Part 2 (Viola Jones):
Link
Lecture 8 Part 3 (Convolutional Neural Net):
Link
📗 Relevant websites
Support Vector Machine:
Link
RBF Kernel SVM Demo:
Link
Decision Tree:
Link
Random Forrest Demo:
Link
K Nearest Neighbor:
Link
Map of Manhattan:
Link
Voronoi Diagram:
Link
KD Tree:
Link
Image Filter:
Link
Canny Edge Detection:
Link
SIFT:
PDF
HOG:
PDF
Conv Net on MNIST:
Link
Conv Net Vis:
Link
LeNet:
PDF,
Link
Google Inception Net:
PDF
CNN Architectures:
Link
Image to Image:
Link
Image segmentation:
Link
Image colorization:
Link,
Link
Image Reconstruction:
Link
Style Transfer:
Link
Move Mirror:
Link
Pose Estimation:
Link
YOLO Attack:
YouTube
📗 YouTube videos from 2019 and 2020
How to find the margin expression for SVM?
Link
Why does the kernel trick work?
Link
Example (Quiz): Compute SVM classifier
Link
Example (Quiz): Kernel SVM for XOR operator
Link
Example (Quiz): Kernel matrix to feature vector
Link
Example (Quiz): Entropy computation
Link
Example (Quiz): Decision tree for implication operator
Link
Example (Quiz): Three nearest neighbor
Link
How to find the HOG features?
Link
How to count the number of weights for training for a convolutional neural network (LeNet)?
Link
Example (Quiz): How to find the 2D convolution between two matrices?
Link
Example (Homework): How to find a discrete approximate Gausian filter?
Link
# Keywords and Notations
📗 Support Vector Machine
SVM classifier: \(\hat{y}_{i} = 1_{\left\{w^\top x_{i} + b \geq 0\right\}}\).
Hard margin, original max-margin formulation: \(\displaystyle\max_{w} \dfrac{2}{\sqrt{w^\top w}}\) such that \(w^\top x_{i} + b \leq -1\) if \(y_{i} = 0\) and \(w^\top x_{i} + b \geq 1\) if \(y_{i} = 1\).
Hard margin, simplified formulation: \(\displaystyle\min_{w} \dfrac{1}{2} w^\top w\) such that \(\left(2 y_{i} - 1\right)\left(w^\top x_{i} + b\right) \geq 1\).
Soft margin, original max-margin formulation: \(\displaystyle\min_{w} \dfrac{1}{2} w^\top w + \dfrac{1}{\lambda} \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} \xi_{i}\) such that \(\left(2 y_{i} - 1\right)\left(w^\top x_{i} + b\right) \geq 1 - \xi, \xi \geq 0\), where \(\xi_{i}\) is the slack variable for instance \(i\), \(\lambda\) is the regularization parameter.
Soft margin, simplified formulation: \(\displaystyle\min_{w} \dfrac{\lambda}{2} w^\top w + \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} \displaystyle\max\left\{0, 1 - \left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right)\right\}\)
Subgradient descent formula: \(w = \left(1 - \lambda\right) w - \alpha \left(2 y_{i} - 1\right) 1_{\left\{\left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right) \geq 1\right\}} x_{i}\).
📗 Kernel Trick
Kernel SVM classifier: \(\hat{y}_{i} = 1_{\left\{w^\top \varphi\left(x_{i}\right) + b \geq 0\right\}}\), where \(\varphi\) is the feature map.
Kernal Gram matrix: \(K_{i i'} = \varphi\left(x_{i}\right)^\top \varphi\left(x_{i'}\right)\).
Quadratic Kernel: \(K_{i i'} = \left(x_{i^\top} x_{i'} + 1\right)^{2}\) has feature representation \(\varphi\left(x_{i}\right) = \left(x_{i1}^{2}, x_{i2}^{2}, \sqrt{2} x_{i1} x_{i2}, \sqrt{2} x_{i1}, \sqrt{2} x_{i2}, 1\right)\).
Gaussian RBF Kernel: \(K_{i i'} = \exp\left(- \dfrac{1}{2 \sigma^{2}} \left(x_{i} - x_{i'}\right)^\top \left(x_{i} - x_{i'}\right)\right)\) has infinite-dimensional feature representation, where \(\sigma^{2}\) is the variance parameter.
📗 Information Theory:
Entropy: \(H\left(Y\right) = -\displaystyle\sum_{y=1}^{K} p_{y} \log_{2} \left(p_{y}\right)\), where \(K\) is the number of classes (number of possible labels), \(p_{y}\) is the fraction of data points with label \(y\).
Conditional entropy: \(H\left(Y | X\right) = -\displaystyle\sum_{x=1}^{K_{X}} p_{x} \displaystyle\sum_{y=1}^{K} p_{y|x} \log_{2} \left(p_{y|x}\right)\), where \(K_{X}\) is the number of possible values of feature, \(p_{x}\) is the fraction of data points with feature \(x\), \(p_{y|x}\) is the fraction of data points with label \(y\) among the ones with feature \(x\).
Information gain, for feature \(j\): \(I\left(Y | X_{j}\right) = H\left(Y\right) - H\left(Y | X_{j}\right)\).
📗 Decision Tree:
Decision stump classifier: \(\hat{y}_{i} = 1_{\left\{x_{ij} \geq t_{j}\right\}}\), where \(t_{j}\) is the threshold for feature \(j\).
Feature selection: \(j^\star = \mathop{\mathrm{argmax}}_{j} I\left(Y | X_{j}\right)\).
📗 Convolution
Convolution (1D): \(a = x \star w\), \(a_{j} = \displaystyle\sum_{t=-k}^{k} w_{t} x_{j-t}\), where \(w\) is the filter, and \(k\) is half of the width of the filter.
Convolution (2D): \(A = X \star W\), \(A_{j j'} = \displaystyle\sum_{s=-k}^{k} \displaystyle\sum_{t=-k}^{k} W_{s,t} X_{j-s,j'-t}\), where \(W\) is the filter, and \(k\) is half of the width of the filter.
Sobel filter: \(W_{x} = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}\) and \(W_{y} = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix}\).
Image gradient: \(\nabla_{x} X = W_{x} \star X\), \(\nabla_{y} X = W_{y} \star X\), with gradient magnitude \(G = \sqrt{\nabla_{x}^{2} + \nabla_{y}^{2}}\) and gradient direction \(\Theta = arctan\left(\dfrac{\nabla_{y}}{\nabla_{x}}\right)\).
📗 Convolutional Neural Network
Fully connected layer: \(a = g\left(w^\top x + b\right)\), where \(a\) is the activation unit, \(g\) is the activation function.
Convolution layer: \(A = g\left(W \star X + b\right)\), where \(A\) is the activation map.
Pooling layer: (max-pooling) \(a = \displaystyle\max\left\{x_{1}, ..., x_{m}\right\}\), (average-pooling) \(a = \dfrac{1}{m} \displaystyle\sum_{j=1}^{m} x_{j}\).