Young Wu's Homepage

# Videos on Past Exam Questions

📗 Perceptron:

Why does the (batch) perceptron algorithm work? Link
Why cannot use linear regression for binary classification? Link
How to use Perceptron update formula? Link
How to find the size of the hypothesis space for linear classifiers? Link (Part 1)

📗 Gradient Descent:

Why does gradient descent work? Link
Computation of Hessian of quadratic form Link
Computation of eigenvalues Link
Gradient descent for linear regression Link
What is the gradient descent step for cross-entropy loss with linear activation? Link (Part 1)
What is the sign of the next gradient descent step? Link (Part 2)
Which loss functions are equivalent to squared loss? Link (Part 3)
How to compute the gradient of the cross-entropy loss with linear activation? Link (Part 4)
How to find the location that minimizes the distance to multiple points? Link (Part 3)

📗 Logistic Regression:

How to derive logistic regression gradient descent step formula? Link
Gradient descent for logistic activation with squared error Link
How to compute logistic gradient descent? Link

📗 Neural Network:

How to construct XOR network? Link
How derive 2-layer neural network gradient descent step? Link
How derive multi-layer neural network gradient descent induction step? Link
How to find the missing weights in neural network given data set? Link
How many weights are used in one step of backpropogation? Link (Part 2)

📗 Regularization:

Comparison between L1 and L2 regularization. Link
How to compute cross validation accuracy? Link

📗 Hard Margin Support Vector Machine:

How to find the margin expression for SVM? Link
Compute SVM classifier Link
How to find the distance from a plane to a point? Link
How to find the formula for SVM given two training points? Link
What is the largest number of points that can be removed to maintain the same SVM? Link (Part 4)
What is minimum number of points that can be removed to improve the SVM margin? Link (Part 5)
How many training items are needed for a one-vs-one SVM? Link (Part 2)
Which items are used in a multi-lcass one-vs-one SVM? Link (Part 7)

📗 Soft Margin Support Vector Machine:

What is the gradient descent step for SVM hinge loss with linear activation? Link (Part 1)
How to compute the subgradient? Link (Part 2)
What happens if the lambda in soft-margin SVM is 0? Link (Part 3)
How to compute the hinge loss gradient? Link (Part 1)

📗 Kernel Trick:

Why does the kernel trick work? Link
How to find feature representation for sum of two kernel (Gram) matrices? Link
What is the kernel SVM for XOR operator? Link
How to convert the kernel matrix to feature vector? Link
How to find the kernel (Gram) matrix given the feature representation? Link (Part 1)
How to find the feature vector based on the kernel (Gram) matrix? Link (Part 4)
How to find the kernal (Gram) matrix based on the feature vectors? Link (Part 10)

📗 Entropy:

How to do entropy computation? Link
How to find the information gain given two distributions (this is the Avatar question)? Link
What distribution maximizes the entropy? Link (Part 1)
How to create a dataset with information gain of 0? Link (Part 2)
How to compute the conditional entropy based on a binary variable dataset? Link (Part 3)
How to find conditional entropy given a dataset? Link (Part 9)
When is the information gain based on a dataset equal to zero? Link (Part 10)
How to compute entropy of a binary variable? Link (Part 1)
How to compute information gain, the Avatar question? Link (Part 2)
How to compute conditional entropy based on a training set? Link (Part 3)

📗 Decision Trees:

What is the decision tree for implication operator? Link
How many conditional entropy calculations are needed for a decision tree with real-valued features? Link (Part 1)
What is the maximum and minimum training set accuracy for a decision tree? Link (Part 2)
How to find the minimum number of conditional entropies that need to be computed for a binary decision tree? Link (Part 9)
What is the maximum number of conditional entropies that need to be computed in a decision tree at a certain depth? Link (Part 4)

📗 Nearest Neighbor:

How to do three nearest neighbor 3NN? Link
How to find a KNN decision boundary? Link
What is the accuracy for KNN when K = n or K = 1? Link (Part 1)
Which K maximizes the accuracy of KNN? Link (Part 3)
How to work with KNN with distance defined on the alphabet? Link (Part 4)
How to find the 1NN accuracy on training set? Link (Part 8)
How to draw the decision boundary of 1NN in 2D? Link (Part 1)
How to find the smallest k such that all items are classified as the same label with kNN? Link (Part 2)
Which value of k maximizes the accuracy of kNN? Link (Part 3)

📗 K-Fold Validation:

How to compute the leave-one-out accuracy for kNN with large k? Link
What is the leave-one-out accuracy for KNN with K = n? Link (Part 2)
How to compute cross validation accuracy for KNN? Link (Part 5)
What is the leave-one-out accuracy for n-1-NN? Link (Part 5)
How to find the 3 fold cross validation accuracy of a 1NN classifier? Link (Part 12)

📗 Convolution and Image Gradient:

How to compute the convolution between two matrices? Link (Part 1)
How to compute the convolution between a matrix an a gradient (Sobel) filter? Link (Part 2)
How to find the 2D convolution between two matrices? Link
How to find a discrete approximate Gausian filter? Link
How to find the HOG features? Link
How to compute the gradient magnitude of pixel? Link (Part 3)
How to compute the convolution of a 2D image with a Sobel filter? Link (Part 2)
How to compute the convolution of a 2D image with a 1D gradient filter? Link (Part 8)
How to compute the convolution of a 2D image with a sparse 2D filter? Link (Part 13)
How to find the gradient magnitude using Sobel filter? Link (Part 3)
How to find the gradient direction bin? Link (Part 4)

📗 Convolutional Neural Network:

How to count the number of weights for training for a convolutional neural network (LeNet)? Link
How to find the number of weights in a CNN? Link
How to compute the activation map after a pooling layer? Link (Part 1)
How to find the number of weights in a CNN? Link (Part 2)
How to compute the activation map after a max-pooling layer? Link (Part 11)
How many weights are there in a CNN? Link (Part 11)
How to find the number of weights and biases in a CNN? Link (Part 1)
How to find the activation map after a pooling layer? Link (Part 2)

📗 Probability and Bayes Rule:

How to compute the probability of A given B knowing the probability of A given not B? Link (Part 4)
How to compute the marginal probabilities given the ratio between the conditionals? Link (Part 1)
How to compute the conditional probabilities given the same variable? Link (Part 1)
What is the probability of switch between elements in a cycle? Link (Part 2)
Which marginal probabilities are valid given the joint probabilities? Link (Part 3)
How to use the Bayes rule to find which biased coin leads to a sequence of coin flips? Link
Please do NOT forget to submit your homework on Canvas! Link
How to use Bayes rule to find the probability of truth telling? Link (Part 6)
How to estimate fraction given randomized survey data? Link (Part 12)
How to write down the joint probability table given the variables are independent? Link (Part 13)
Given the ratio between two conditional probabilities, how to compute the marginal probabilities? Link (Part 1)
What is the Boy or Girl paradox? Link (Part 3)
How to compute the maximum likelihood estimate of a conditional probability given a count table? Link (Part 1)
How to compare the probabilities in the Boy or Girl Paradox? Link (Part 3)

📗 N-Gram Model and Markov Chains:

How to compute the MLE probability of a sentence given a training document? Link (Part 1)
How to find maximum likelihood estimates for Bernoulli distribution? Link
How to generate realizations of discrete random variables using CDF inversion? Link
How to find the sentence generated given the random number from CDF inversion? Link (Part 3)
How to find the probability of observing a sentence given the first and last word using the transition matrix? Link (Part 14)
How many conditional probabilities need to be stored for a n-gram model? Link (Part 2)

📗 Bayesian Network:

How to compute the joint probability given the conditional probability table? Link
How to compute conditional probability table given training data? Link
How to do inference (find joint and conditional probability) given conditional probability table? Link
How to find the conditional probabilities for a common cause configuration? Link
What is the size of the conditional probability table? Link
How to compute a condition probability given a Bayesian network with three variables? Link
What is the size of a conditional probability table of two discrete variables? Link (Part 2)
How many joint probabilities are needed to compute a marginal probability? Link (Part 3)
How to compute the MLE conditional probability with Laplace smoothing given a data set? Link (Part 2)
What is the number of conditional probabilities stored in a CPT given a Bayesian network? Link (Part 3)
How to compute the number of probabilities in a CPT for variables with more than two possible values? Link (Part 14)
How to find the MLE of the conditional probability given the sum of two variables? Link (Part 5)
How many joint probabilities are used in the computation of a marginal probability? Link (Part 4)
How to find the size of an arbitrary Bayesian network with binary variables? Link (Part 3)

📗 Navie Bayes and Hidden Markov Model:

How to use naive Bayes classifier to do multi-class classification? Link
How to find the size of the conditional probability table for a Naive Bayes model? Link
How to compute the probability of observing a sequence under an HMM? Link
What is the number of conditional probabilities stored in a CPT given a Naive Bayes model? Link (Part 4)
How to find th obervation probabilities given an HMM? Link (Part 2)
What is the size of the CPT for a Naive Bayes network? Link (Part 1)
How to detect virus in email messages using Naive Bayes? Link (Part 2)
What is the relationship between Naive Bayes and Logistic Regression? Link

📗 Hierarchical Clustering

How to update distance table for hierarchical clustering? Link
How to do hierarchical clustering for 1D points? Link
How to do hierarchical clustering given pairwise distance table? Link

📗 K-Means Clustering

What is the relationship between K Means and Gradient Descent? Link
How to update cluster centers for K-means clustering? Link
How to find the cluster center so that a fixed number of items are assigned to each K-means cluster? Link
How to find the cluster center so that one of the clusters is empty? Link (Part 9)

📗 PCA

Why is PCA solving eigenvalues and eigenvectors? Part 1, Part 2, Part 3
How to compute projection? Link
How to compute new features based on PCA? Link
How to compute the projected variance? Link (Part 8)

📗 Reinforcement Learning

How to compute value function given policy? Link
How to compute optimal value function? Link

📗 Uninformed Search

How to get expansion path for BFS? Link
How to get expansion path for DFS? Link
How to get expansion path for IDS? Link
What is the shape of tree for IDS to search the quickest? Link
How to do backtracking for search problems? Link
How to compute time complexity for multi-branch trees? Link
How to find the best case time complexity? Link (Part 4)
What is the shape of the tree that minimizes the time complexity of IDS? Link (Part 8)
What is the minimum number of nodes searched given the goal depth? Link (Part 4)
How to find the number of states expanded during search for a large tree? Link (Part 12)
How to find all possible configurations of the 3-puzzle? Link (Part 1)
How to find the time complexity on binary search tree with large number of nodes? Link (Part 2, Part 3)
How to find the shape of a search tree such that IDS is the quickest? Link (Part 1)

📗 Informed Search

How to get expansion path for UCS? Link
How to get expansion path for BFGS? Link
How to get expansion path for A? Link
How to get expansion path for A*? Link
How to check if a heuristic is admissible? Link
How to find the expansion sequence for uniform cost search? Link
Which functions of two admissible heuristic are still admissible? Link
How to do A search on a maze? Link (Part 2)

📗 Hill Climbing

How to do hill climbing on 2D state spaces? Link
How to do hill climbing for SAT problems? Link
What is the number of flips needed to move from one binary sequence to another? Link (Part 7)
What is the local minimum of a linear function with three variables? Link (Part 14)
How to use hill climbing to solve the graph coloring problem? Link (Part 7)
How to do hill climbing on 3D state spaces? Link (Part 1)
How to find the shortest sequence of flipping consecutive entries to reach a specific configuration? Link

📗 Simulated Annealing

How to find the probability of moving in simulated annealing? Link
Which temperature would minimize the probability of moving in simulated annealing? Link (Part 2)

📗 Genetic Algorithm

How to find reproduction probabilities? Link
How to find the state with the highest reproduction probability given the argmax-argmin fitness functions? Link (Part 1, Part 2)
How to compute reproduction probabilities? Link

📗 Extensive Form Game

How to solve the lions game? Link
How to solve the pirate game? Link
How to solve the wage competition game (sequential version)? Link
How to solve a simple game with Chance? Link
How to figure out which branches can be pruned using Alpha Beta algorithm? Simple Link, Complicated Link
How to solve the Rubinstein Bargaining problem? Link
How to figure out which nodes are alpha-beta pruned? Link
How to find the solution of the II-nim game? Link (Part 2)
How to find the solution of a game with Chance? Link (Part 11)
How to compute the value of a game with Chance? Link (Part 11)
How to reorder the branches so that alpha-beta pruning will prune the largest number of nodes? Link (Part 13)
What is the order of the branches that maximizes the number of alpha-beta pruned nodes? Link (Part 13)
How to reorder the subtrees so that alpha-beta would prune the largest number of nodes? Link (Part 1)
How to find the value of the game for II-nim games? Link (Part 2)
How to solve for the SPE for a game with Chance? Link (Part 3)

📗 Normal Form Game

How to find the Nash equilibrium of a zero sum game? Link
How to do iterated elimination of strictly dominated strategies (IESDS)? Link
How to find the mixed strategy Nash equilibrium of a simple 2 by 2 game? Link
What is the median voter theorem? Link
How to guess and check a mixed strategy Nash equilibrium of a simple 3 by 3 game? Link
How to solve the mixing probabilities of the volunteer's dilemma game? Link
What is the Nash equilibrium of the vaccination game? Link
How to find the mixed strategy best responses? Link
How to compute the Nash equilibrium for zero-sum matrix games? Link
How to draw the best responses functions with mixed strategies? Link
How to compute the pure Nash equilibrium of the high way game? Link (Part 5)
What is the value of a mixed strategy Nash equilibrium? Link (Part 6)
How to compute the pure Nash equilibrium of the vaccination game? Link (Part 5)
How to find the value of the battle of the sexes game? Link (Part 6)
How to redesign the game to implement a Nash equilibrium? Link (Part 10)
How to find all Nash equilibria using best response functions? Link (Part 1)
How to compute the Nash equilibrium of the pollution game? Link (Part 3)
How to compute a symmetric mixed strategy Nash equilibrium for the volunteer's dilemma game? Link (Part 10)
How to perform iterated elimination of strictly dominated strategies? Link (Part 14)
How to compute the Nash equilibrium where only one player mixes? Link (Part 1)
How to compute the mixed Nash for the battle of sexes game? Link (Part 1)
How to compute the game with indifferences where only one player mixes? Link (Part 2)
How to modify the game so that a specific entry is the Nash? Link (Part 1)
What is the Nash equilibrium of a the highway game? Link (Part 2)
What is the Nash equilibrium of the pollution game? Link (Part 3)
How to find the Nash equilibrium of the vaccination game? Link (Part 4)

# Lecture Videos (Old)

Lecture 1 Part 1 (Admin, 2021): Link and Link
Lecture 1 Part 2 (Supervised learning): Link
Lecture 1 Part 3 (Perceptron learning): Link
Lecture 2 Part 1 (Loss functions): Link
Lecture 2 Part 2 (Logistic regression): Link
Lecture 2 Part 3 (Convexity): Link

Lecture 3 Part 1 (Neural Network): Link
Lecture 3 Part 2 (Backpropogation): Link
Lecture 3 Part 3 (Multi-Layer Network): Link
Lecture 4 Part 1 (Stochastic Gradient): Link
Lecture 4 Part 2 (Multi-Class Classification): Link
Lecture 4 Part 3 (Regularization): Link

Lecture 5 Part 1 (Support Vector Machines): Link
Lecture 5 Part 2 (Subgradient Descent): Link
Lecture 5 Part 3 (Kernel Trick): Link
Lecture 6 Part 1 (Decision Tree): Link
Lecture 6 Part 2 (Random Forrest): Link
Lecture 6 Part 3 (Nearest Neighbor): Link

Lecture 7 Part 1 (Convolution): Link
Lecture 7 Part 2 (Gradient Filters): Link
Lecture 7 Part 3 (Computer Vision): Link
Lecture 8 Part 1 (Computer Vision): Link
Lecture 8 Part 2 (Viola Jones): Link
Lecture 8 Part 3 (Convolutional Neural Net): Link

Lecture 10 Part 1 (Generative Models): Link
Lecture 10 Part 2 (Natural Language): Link
Lecture 10 Part 3 (Sampling): Link
Lecture 11 Part 1 (Probability Distribution): Link
Lecture 11 Part 2 (Bayesian Network): Link
Lecture 11 Part 3 (Network Structure): Link
Lecture 11 Part 4 (Naive Bayes): Link

Lecture 12 Part 1 (Hidden Markov Model): Link
Lecture 12 Part 2 (HMM Evaluation): Link
Lecture 12 Part 3 (HMM Training): Link
Lecture 12 Part 4 (Recurrent Neural Network): Link
Lecture 12 Part 5 (Backprop Through Time): Link
Lecture 12 Part 6 (RNN Variants): Link

Lecture 13 (Reinforcement Learning): Guest Lecture (see Canvas Zoom recording)
Lecture 14 (Optimization): Guest Lecture (see Canvas Zoom recording)

Lecture 15 Part 1 (Unsupervised Learning): Link
Lecture 15 Part 2 (Hierarchical Clustering): Link
Lecture 15 Part 3 (K Means Clustering): Link
Lecture 16 Part 1 (Dimensionality Reduction): Link
Lecture 16 Part 2 (Principal Component): Link
Lecture 16 Part 3 (Non-linear PCA): Link

Lecture 17 Part 1 (Uninformed Search): Link
Lecture 17 Part 2 (Breadth First Search): Link
Lecture 17 Part 3 (Depth First Search): Link
Lecture 18 Part 1 (Informed Search): Link
Lecture 18 Part 2 (Uniform Cost and Greedy): Link
Lecture 18 Part 3 (A Search): Link

Lecture 20 Part 1 (Hill Climbing): Link
Lecture 20 Part 2 (Simulated Annealing): Link
Lecture 20 Part 3 (Genetic Algorithm): Link

Lecture 21 Part 1 (Adversarial Search): Link
Lecture 21 Part 2 (Alpha Beta Pruning): Link
Lecture 21 Part 3 (Heuristic): Link
Lecture 22 Part 1 (Rationalizability): Link
Lecture 22 Part 2 (Nash Equilibrium): Link
Lecture 22 Part 3 (Mixed Strategies): Link

Lecture 23 (Repeated Games): Interactive Lecture (see Canvas Zoom recording)
Lecture 24 (Mechanism Design): Interactive Lecture (see Canvas Zoom recording)

# Formula Sheets (Old)

📗 Supervised Learning:

Training item: \(\left(x_{i}, y_{i}\right)\), where \(i \in \left\{1, 2, ..., n\right\}\) is the instance index, \(x_{ij}\) is the feature \(j\) of instance \(i\), \(j \in \left\{1, 2, ..., m\right\}\) is the feature index, \(x_{i} = \left(x_{i1}, x_{i2}, ...., x_{im}\right)\) is the feature vector of instance \(i\), and \(y_{i}\) is the true label of instance \(i\).
Test item: \(\left(x', y'\right)\), where \(j \in \left\{1, 2, ..., m\right\}\) is the feature index.

📗 Linear Threshold Unit, Linear Perceptron:

LTU Classifier: \(\hat{y}_{i} = 1_{\left\{w^\top x_{i} + b \geq 0\right\}}\), where \(w = \left(w_{1}, w_{2}, ..., w_{m}\right)\) is the weights, \(b\) is the bias, \(x_{i} = \left(x_{i1}, x_{i2}, ..., x_{im}\right)\) is the feature vector of instance \(i\), and \(\hat{y}_{i}\) is the predicted label of instance \(i\).
Perceptron algorithm update step: \(w = w - \alpha \left(a_{i} - y_{i}\right) x_{i}\), \(b = b - \alpha \left(a_{i} - y_{i}\right)\), \(a_{i} = 1_{\left\{w^\top x_{i} + b \geq 0\right\}}\), where \(a_{i}\) is the activation value of instance \(i\).

📗 Loss Function:

Zero-one loss minimization: \(\hat{f} = \mathop{\mathrm{argmin}}_{f \in \mathcal{H}} \displaystyle\sum_{i=1}^{n} 1_{\left\{f\left(x_{i}\right) \neq y_{i}\right\}}\), where \(\hat{f}\) is the optimal classifier, \(\mathcal{H}\) is the hypothesis space (set of functions to choose from).
Squared loss minimization of perceptrons: \(\left(\hat{w}, \hat{b}\right) = \mathop{\mathrm{argmin}}_{w, b} \dfrac{1}{2} \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)^{2}\), \(a_{i} = g\left(w^\top x_{i} + b\right)\), where \(\hat{w}\) is the optimal weights, \(\hat{b}\) is the optimal bias, \(g\) is the activation function.

📗 Logistic Regression:

Logistic regression classifier: \(\hat{y}_{i} = 1_{\left\{a_{i} \geq 0.5\right\}}\), \(a_{i} = \dfrac{1}{1 + \exp\left(- \left(w^\top x_{i} + b\right)\right)}\).
Loss minimization problem: \(\left(\hat{w}, \hat{b}\right) = \mathop{\mathrm{argmin}}_{w, b} -\displaystyle\sum_{i=1}^{n} \left(y_{i} \log\left(a_{i}\right) + \left(1 - y_{i}\right) \log\left(1 - a_{i}\right)\right)\), \(a_{i} = \dfrac{1}{1 + \exp\left(- \left(w^\top x_{i} + b\right)\right)}\).
Batch gradient descrent step: \(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right) x_{i}\), \(b = b - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\), \(a_{i} = \dfrac{1}{1 + \exp\left(- \left(w^\top x_{i} + b\right)\right)}\), where \(\alpha\) is the learning rate.

📗 Neural Network:

Neural network classifier for two layer network with logistic activation: \(\hat{y}_{i} = 1_{\left\{a^{\left(2\right)}_{i} \geq 0.5\right\}}\)
\(a^{\left(1\right)}_{ij} = \dfrac{1}{1 + \exp\left(- \left(\left(\displaystyle\sum_{j'=1}^{m} x_{ij'} w^{\left(1\right)}_{j'j}\right) + b^{\left(1\right)}_{j}\right)\right)}\), where \(m\) is the number of features (or input units), \(w^{\left(1\right)}_{j' j}\) is the layer \(1\) weight from input unit \(j'\) to hidden layer unit \(j\), \(b^{\left(1\right)}_{j}\) is the bias for hidden layer unit \(j\), \(a_{ij}^{\left(1\right)}\) is the layer \(1\) activation of instance \(i\) hidden unit \(j\).
\(a^{\left(2\right)}_{i} = \dfrac{1}{1 + \exp\left(- \left(\left(\displaystyle\sum_{j=1}^{h} a^{\left(1\right)}_{ij} w^{\left(2\right)}_{j}\right) + b^{\left(2\right)}\right)\right)}\), where \(h\) is the number of hidden units, \(w^{\left(2\right)}_{j}\) is the layer \(2\) weight from hidden layer unit \(j\), \(b^{\left(2\right)}\) is the bias for the output unit, \(a^{\left(2\right)}_{i}\) is the layer \(2\) activation of instance \(i\).
Stochastic gradient descent step for two layer network with squared loss and logistic activation:
\(w^{\left(1\right)}_{j' j} = w^{\left(1\right)}_{j' j} - \alpha \left(a^{\left(2\right)}_{i} - y_{i}\right) a^{\left(2\right)}_{i} \left(1 - a^{\left(2\right)}_{i}\right) w_{j}^{\left(2\right)} a_{ij}^{\left(1\right)} \left(1 - a_{ij}^{\left(1\right)}\right) x_{ij'}\).
\(b^{\left(1\right)}_{j} \leftarrow b^{\left(1\right)}_{j} - \alpha \left(a^{\left(2\right)}_{i} - y_{i}\right) a^{\left(2\right)}_{i} \left(1 - a^{\left(2\right)}_{i}\right) w_{j}^{\left(2\right)} a_{ij}^{\left(1\right)} \left(1 - a_{ij}^{\left(1\right)}\right)\).
\(w^{\left(2\right)}_{j} \leftarrow w^{\left(2\right)}_{j} - \alpha \left(a^{\left(2\right)}_{i} - y_{i}\right) a^{\left(2\right)}_{i} \left(1 - a^{\left(2\right)}_{i}\right) a_{ij}^{\left(1\right)}\).
\(b^{\left(2\right)} \leftarrow b^{\left(2\right)} - \alpha \left(a^{\left(2\right)}_{i} - y_{i}\right) a^{\left(2\right)}_{i} \left(1 - a^{\left(2\right)}_{i}\right)\).

📗 Multiple Classes:

Softmax activation for one layer networks: \(a_{ij} = \dfrac{\exp\left(- \left(w_{k^\top} x_{i} + b_{k}\right)\right)}{\displaystyle\sum_{k' = 1}^{K} \exp\left(- \left(w_{k'}^\top x_{i} + b_{k'}\right)\right)}\), where \(K\) is the number of classes (number of possible labels), \(a_{i k}\) is the activation of the output unit \(k\) for instance \(i\), \(y_{i k}\) is component \(k\) of the one-hot encoding of the label for instance \(i\).

📗 Regularization:

L1 regularization (squared loss): \(\displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)^{2} + \lambda \left(\displaystyle\sum_{j=1}^{m} \left| w_{j} \right| + \left| b \right|\right)\), where \(\lambda\) is the regularization parameter.
L2 regularization (sqaured loss): \(\displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)^{2} + \lambda \left(\displaystyle\sum_{j=1}^{m} \left(w_{j}\right)^{2} + b^{2}\right)\).

📗 Support Vector Machine

SVM classifier: \(\hat{y}_{i} = 1_{\left\{w^\top x_{i} + b \geq 0\right\}}\).
Hard margin, original max-margin formulation: \(\displaystyle\max_{w} \dfrac{2}{\sqrt{w^\top w}}\) such that \(w^\top x_{i} + b \leq -1\) if \(y_{i} = 0\) and \(w^\top x_{i} + b \geq 1\) if \(y_{i} = 1\).
Hard margin, simplified formulation: \(\displaystyle\min_{w} \dfrac{1}{2} w^\top w\) such that \(\left(2 y_{i} - 1\right)\left(w^\top x_{i} + b\right) \geq 1\).
Soft margin, original max-margin formulation: \(\displaystyle\min_{w} \dfrac{1}{2} w^\top w + \dfrac{1}{\lambda} \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} \xi_{i}\) such that \(\left(2 y_{i} - 1\right)\left(w^\top x_{i} + b\right) \geq 1 - \xi, \xi \geq 0\), where \(\xi_{i}\) is the slack variable for instance \(i\), \(\lambda\) is the regularization parameter.
Soft margin, simplified formulation: \(\displaystyle\min_{w} \dfrac{\lambda}{2} w^\top w + \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} \displaystyle\max\left\{0, 1 - \left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right)\right\}\)
Subgradient descent formula: \(w = \left(1 - \lambda\right) w - \alpha \left(2 y_{i} - 1\right) 1_{\left\{\left(2 y_{i} - 1\right) \left(w^\top x_{i} + b\right) \geq 1\right\}} x_{i}\).

📗 Kernel Trick

Kernel SVM classifier: \(\hat{y}_{i} = 1_{\left\{w^\top \varphi\left(x_{i}\right) + b \geq 0\right\}}\), where \(\varphi\) is the feature map.
Kernal Gram matrix: \(K_{i i'} = \varphi\left(x_{i}\right)^\top \varphi\left(x_{i'}\right)\).
Quadratic Kernel: \(K_{i i'} = \left(x_{i^\top} x_{i'} + 1\right)^{2}\) has feature representation \(\varphi\left(x_{i}\right) = \left(x_{i1}^{2}, x_{i2}^{2}, \sqrt{2} x_{i1} x_{i2}, \sqrt{2} x_{i1}, \sqrt{2} x_{i2}, 1\right)\).
Gaussian RBF Kernel: \(K_{i i'} = \exp\left(- \dfrac{1}{2 \sigma^{2}} \left(x_{i} - x_{i'}\right)^\top \left(x_{i} - x_{i'}\right)\right)\) has infinite-dimensional feature representation, where \(\sigma^{2}\) is the variance parameter.

📗 Information Theory:

Entropy: \(H\left(Y\right) = -\displaystyle\sum_{y=1}^{K} p_{y} \log_{2} \left(p_{y}\right)\), where \(K\) is the number of classes (number of possible labels), \(p_{y}\) is the fraction of data points with label \(y\).
Conditional entropy: \(H\left(Y | X\right) = -\displaystyle\sum_{x=1}^{K_{X}} p_{x} \displaystyle\sum_{y=1}^{K} p_{y|x} \log_{2} \left(p_{y|x}\right)\), where \(K_{X}\) is the number of possible values of feature, \(p_{x}\) is the fraction of data points with feature \(x\), \(p_{y|x}\) is the fraction of data points with label \(y\) among the ones with feature \(x\).
Information gain, for feature \(j\): \(I\left(Y | X_{j}\right) = H\left(Y\right) - H\left(Y | X_{j}\right)\).

📗 Decision Tree:

Decision stump classifier: \(\hat{y}_{i} = 1_{\left\{x_{ij} \geq t_{j}\right\}}\), where \(t_{j}\) is the threshold for feature \(j\).
Feature selection: \(j^\star = \mathop{\mathrm{argmax}}_{j} I\left(Y | X_{j}\right)\).

📗 K-Nearest Neighbor:

Distance: (Euclidean) \(\rho\left(x, x'\right) = \left\|x - x'\right\|_{2} = \sqrt{\displaystyle\sum_{j=1}^{m} \left(x_{j} - x'_{j}\right)^{2}}\), (Manhattan) \(\rho\left(x, x'\right) = \left\|x - x'\right\|_{1} = \displaystyle\sum_{j=1}^{m} \left| x_{j} - x'_{j} \right|\), where \(x, x'\) are two instances.
K-Nearest Neighbor classifier: \(\hat{y}_{i}\) = mode \(\left\{y_{\left(1\right)}, y_{\left(2\right)}, ..., y_{\left(k\right)}\right\}\), where mode is the majority label and \(y_{\left(t\right)}\) is the label of the \(t\)-th closest instance to instance \(i\) from the training set.

📗 Natural Language Processing:

Unigram model: \(\mathbb{P}\left\{z_{1}, z_{2}, ..., z_{d}\right\} = \displaystyle\prod_{t=1}^{d} \mathbb{P}\left\{z_{t}\right\}\) where \(z_{t}\) is the \(t\)-th token in a training item, and \(d\) is the total number of tokens in the item.
Maximum likelihood estimator (unigram): \(\hat{\mathbb{P}}\left\{z_{t}\right\} = \dfrac{c_{z_{t}}}{\displaystyle\sum_{z=1}^{m} c_{z}}\), where \(c_{z}\) is the number of time the token \(z\) appears in the training set and \(m\) is the vocabulary size (number of unique tokens).
Maximum likelihood estimator (unigram, with Laplace smoothing): \(\hat{\mathbb{P}}\left\{z_{t}\right\} = \dfrac{c_{z_{t}} + 1}{\left(\displaystyle\sum_{z=1}^{m} c_{z}\right) + m}\).
Bigram model: \(\mathbb{P}\left\{z_{1}, z_{2}, ..., z_{d}\right\} = \mathbb{P}\left\{z_{1}\right\} \displaystyle\prod_{t=2}^{d} \mathbb{P}\left\{z_{t} | z_{t-1}\right\}\).
Maximum likelihood estimator (bigram): \(\hat{\mathbb{P}}\left\{z_{t} | z_{t-1}\right\} = \dfrac{c_{z_{t-1}, z_{t}}}{c_{z_{t-1}}}\).
Maximum likelihood estimator (bigram, with Laplace smoothing): \(\hat{\mathbb{P}}\left\{z_{t} | z_{t-1}\right\} = \dfrac{c_{z_{t-1}, z_{t}} + 1}{c_{z_{t-1}} + m}\).

📗 Probability Review:

Conditional probability: \(\mathbb{P}\left\{Y = y | X = x\right\} = \dfrac{\mathbb{P}\left\{Y = y, X = x\right\}}{\mathbb{P}\left\{X = x\right\}}\).
Joint probability: \(\mathbb{P}\left\{X = x\right\} = \displaystyle\sum_{y \in Y} \mathbb{P}\left\{X = x, Y = y\right\}\).
Bayes rule: \(\mathbb{P}\left\{Y = y | X = x\right\} = \dfrac{\mathbb{P}\left\{X = x | Y = y\right\} \mathbb{P}\left\{Y = y\right\}}{\displaystyle\sum_{y' \in Y} \mathbb{P}\left\{X = x | Y = y'\right\} \mathbb{P}\left\{Y = y'\right\}}\).
Law of total probability: \(\mathbb{P}\left\{X = x\right\} = \displaystyle\sum_{y' \in Y} \mathbb{P}\left\{X = x | Y = y'\right\} \mathbb{P}\left\{Y = y'\right\}\).
Independence: \(X, Y\) are independent if \(\mathbb{P}\left\{X = x, Y = y\right\} = \mathbb{P}\left\{X = x\right\} \mathbb{P}\left\{Y = y\right\}\) for every \(x, y\).
Conditional independence: \(X, Y\) are conditionally independent conditioned on \(Z\) if \(\mathbb{P}\left\{X = x, Y = y | Z = z\right\} = \mathbb{P}\left\{X = x | Z = z\right\} \mathbb{P}\left\{Y = y | Z = z\right\}\) for every \(x, y, z\).

📗 Bayesian Network

Conditional Probability Table estimation: \(\hat{\mathbb{P}}\left\{x_{j} | p\left(X_{j}\right)\right\} = \dfrac{c_{x_{j}, p\left(X_{j}\right)}}{c_{p\left(X_{j}\right)}}\), where \(p\left(X_{j}\right)\) is the list of parents of \(X_{j}\) in the network.
Conditional Probability Table estimation (with Laplace smoothing): \(\hat{\mathbb{P}}\left\{x_{j} | p\left(X_{j}\right)\right\} = \dfrac{c_{x_{j}, p\left(X_{j}\right)} + 1}{c_{p\left(X_{j}\right)} + \left| X_{j} \right|}\), where \(\left| X_{j} \right|\) is the number of possible values of \(X_{j}\).
Bayesian network inference: \(\mathbb{P}\left\{x_{1}, x_{2}, ..., x_{m}\right\} = \displaystyle\prod_{j=1}^{m} \mathbb{P}\left\{x_{j} | p\left(X_{j}\right)\right\}\).
Naive Bayes estimation: .
Naive Bayes classifier: \(\hat{y}_{i} = \mathop{\mathrm{argmax}}_{y} \mathbb{P}\left\{Y = y | X = X_{i}\right\}\).

📗 Convolution

Convolution (1D): \(a = x \star w\), \(a_{j} = \displaystyle\sum_{t=-k}^{k} w_{t} x_{j-t}\), where \(w\) is the filter, and \(k\) is half of the width of the filter.
Convolution (2D): \(A = X \star W\), \(A_{j j'} = \displaystyle\sum_{s=-k}^{k} \displaystyle\sum_{t=-k}^{k} W_{s,t} X_{j-s,j'-t}\), where \(W\) is the filter, and \(k\) is half of the width of the filter.
Sobel filter: \(W_{x} = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}\) and \(W_{y} = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix}\).
Image gradient: \(\nabla_{x} X = W_{x} \star X\), \(\nabla_{y} X = W_{y} \star X\), with gradient magnitude \(G = \sqrt{\nabla_{x}^{2} + \nabla_{y}^{2}}\) and gradient direction \(\Theta = arctan\left(\dfrac{\nabla_{y}}{\nabla_{x}}\right)\).

📗 Convolutional Neural Network

Fully connected layer: \(a = g\left(w^\top x + b\right)\), where \(a\) is the activation unit, \(g\) is the activation function.
Convolution layer: \(A = g\left(W \star X + b\right)\), where \(A\) is the activation map.
Pooling layer: (max-pooling) \(a = \displaystyle\max\left\{x_{1}, ..., x_{m}\right\}\), (average-pooling) \(a = \dfrac{1}{m} \displaystyle\sum_{j=1}^{m} x_{j}\).

📗 Clustering

📗 Single Linkage: \(d\left(C_{k}, C_{k'}\right) = \displaystyle\min\left\{d\left(x_{i}, x_{i'}\right) : x_{i} \in C_{k}, x_{i'} \in C_{k'}\right\}\), where \(C_{k}, C_{k'}\) are two clusters (set of points), \(d\) is the distance function.

📗 Complete Linkage: \(d\left(C_{k}, C_{k'}\right) = \displaystyle\max\left\{d\left(x_{i}, x_{i'}\right) : x_{i} \in C_{k}, x_{i'} \in C_{k'}\right\}\).

📗 Average Linkage: \(d\left(C_{k}, C_{k'}\right) = \dfrac{1}{\left| C_{k} \right| \left| C_{k'} \right|} \displaystyle\sum_{x_{i} \in C_{k}, x_{i'} \in C_{k'}} d\left(x_{i}, x_{i'}\right)\), where \(\left| C_{k} \right|, \left| C_{k'} \right|\) are the number of the points in the clusters.

📗 Distortion (Euclidean distance): \(D_{K} = \displaystyle\sum_{i=1}^{n} d\left(x_{i}, c_{k^\star\left(x_{i}\right)}\left(x_{i}\right)\right)^{2}\), \(k^\star\left(x\right) = \mathop{\mathrm{argmin}}_{k = 1, 2, ..., K} d\left(x, c_{k}\right)\), where \(k^\star\left(x\right)\) is the cluster \(x\) belongs to.

📗 K-Means Gradient Descent Step: \(c_{k} = \dfrac{1}{\left| C_{k} \right|} \displaystyle\sum_{x \in C_{k}} x\).

📗 Projection: \(\text{proj} _{u_{k}} x_{i} = \left(\dfrac{u_{k^\top} x_{i}}{u_{k^\top} u_{k}}\right) u_{k}\) with length \(\left\|\text{proj} _{u_{k}} x_{i}\right\|_{2} = \left(\dfrac{u_{k^\top} x_{i}}{u_{k^\top} u_{k}}\right)\), where \(u_{k}\) is a principal direction.

📗 Projected Variance (Scalar form, MLE): \(V = \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} \left(u_{k^\top} x_{i} - \mu_{k}\right)^{2}\) such that \(u_{k^\top} u_{k} = 1\), where \(\mu_{k} = \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} u_{k^\top} x_{i}\).

📗 Projected Variance (Matrix form, MLE): \(V = u_{k^\top} \hat{\Sigma} u_{k}\) such that \(u_{k^\top} u_{k} = 1\), where \(\hat{\Sigma}\) is the convariance matrix of the data: \(\hat{\Sigma} = \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} \left(x_{i} - \hat{\mu}\right)\left(x_{i} - \hat{\mu}\right)^\top\), \(\hat{\mu} = \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} x_{i}\).

📗 New Feature: \(\left(u_{1^\top} x_{i}, u_{2^\top} x_{i}, ..., u_{K^\top} x_{i}\right)^\top\).

📗 Reconstruction: \(x_{i} = \displaystyle\sum_{i=1}^{m} \left(u_{k^\top} x_{i}\right) u_{k} \approx \displaystyle\sum_{i=1}^{K} \left(u_{k^\top} x_{i}\right) u_{k}\) with \(u_{k^\top} u_{k} = 1\).

📗 Uninformed Search

📗 Breadth First Search (Time Complexity): \(T = 1 + b + b^{2} + ... + b^{d}\), where \(b\) is the branching factor (number of children per node) and \(d\) is the depth of the goal state.

📗 Breadth First Search (Space Complexity): \(S = b^{d}\).

📗 Depth First Search (Time Complexity): \(T = b^{D-d+1} + ... + b^{D-1} + b^{D}\), where \(D\) is the depth of the leafs.

📗 Depth First Search (Space Complexity): \(S = \left(b - 1\right) D + 1\).

📗 Iterative Deepening Search (Time Complexity): \(T = d + d b + \left(d - 1\right) b^{2} + ... + 3 b^{d-2} + 2 b^{d-1} + b^{d}\).

📗 Iterative Deepening Search (Space Complexity): \(S = \left(b - 1\right) d + 1\).

📗 Informed Search

📗 Admissible Heuristic: \(h : 0 \leq h\left(s\right) \leq h^\star\left(s\right)\), where \(h^\star\left(s\right)\) is the actual cost from state \(s\) to the goal state, and \(g\left(s\right)\) is the actual cost of the initial state to \(s\).

📗 Local Search

📗 Hill Climbing (Valley Finding), probability of moving from \(s\) to a state \(s'\) \(p = 0\) if \(f\left(s'\right) \geq f\left(s\right)\) and \(p = 1\) if \(f\left(s'\right) < f\left(s\right)\), where \(f\left(s\right)\) is the cost of the state \(s\).

📗 Simulated Annealing, probability of moving from \(s\) to a worse state \(s'\) = \(p = e^{- \dfrac{\left| f\left(s'\right) - f\left(s\right) \right|}{T\left(t\right)}}\) if \(f\left(s'\right) \geq f\left(s\right)\) and \(p = 1\) if \(f\left(s'\right) < f\left(s\right)\), where \(T\left(t\right)\) is the temperature as time \(t\).

📗 Genetic Algorithm, probability of get selected as a parent in cross-over: \(p_{i} = \dfrac{F\left(s_{i}\right)}{\displaystyle\sum_{j=1}^{n} F\left(s_{j}\right)}\), \(i = 1, 2, ..., N\), where \(F\left(s\right)\) is the fitness of state \(s\).

📗 Adversarial Search

📗 Sequential Game (Alpha Beta Pruning): prune the tree if \(\alpha \geq \beta\), where \(\alpha\) is the current value of the MAX player and \(\beta\) is the current value of the MIN player.

📗 Simultaneous Move Game (rationalizable): remove an action \(s_{i}\) of player \(i\) if it is strictly dominated \(F\left(s_{i}, s_{-i}\right) < F\left(s'_{i}, s_{-i}\right)\), for some \(s'_{i}\) of player \(i\) and for all \(s_{-i}\) of the other players.

📗 Simultaneous Move Game (Nash equilibrium): \(\left(s_{i}, s_{-i}\right)\) is a (pure strategy) Nash equilibrium if \(F\left(s_{i}, s_{-i}\right) \geq F\left(s'_{i}, s_{-i}\right)\) and \(F\left(s_{i}, s_{-i}\right) \geq F\left(s_{i}, s'_{-i}\right)\), for all \(s'_{i}, s'_{-i}\).

Last Updated: June 26, 2024 at 8:57 PM