# Details
📗 Slides:
(1) The slides subtitled "Definition" and "Quiz" contain the mathematics and statistics that you are required to know for the exams.
(2) The slides subtitled "Motivation" and "Discussion" contain concepts you should be familiar with, but the specific mathematics will not be tested on the exam.
(3) The slides subtitled "Description" and "Algorithm" are mostly useful for programming homework, not exams.
(4) The slides subtitled "Admin" are not relevant to the course materials.
📗 Questions:
(1) Around a third of the questions will be exactly the same as the homework questions (different randomization of parameters), you can practice by solving these homework again with someone else's ID (auto-grading will not work if you do not enter an ID).
(2) Around a third of the questions will be similar to the past exam or quiz questions (ones that are covered during the lectures), going over the quiz questions will help, and solving the past exam questions will help.
(3) Around a third of the questions will be new, mostly from topics not covered in the homework, reading the slides will be helpful.
📗 Question types:
All questions will ask you to enter a number, vector (or list of options), or matrix. There will be no drawing or selecting objects on a canvas, and no text entry or essay questions. You will not get the hints like the ones in the homework. You can type your answers in a text file directly and submit it on Canvas. If you use the website, you can use the "calculate" button to make sure the expression you entered can be evaluated correctly when graded. You will receive 0 for incorrect answers and not-evaluate-able expressions, no partial marks, and no additional penalty for incorrect answers.
# Keywords and Notations
📗 Supervised Learning:
Training item: , where is the instance index, is the feature of instance , is the feature index, is the feature vector of instance , and is the true label of instance .
Test item: , where is the feature index.
📗 Linear Threshold Unit, Linear Perceptron:
LTU Classifier: , where is the weights, is the bias, is the feature vector of instance , and is the predicted label of instance .
Perceptron algorithm update step: , , , where is the activation value of instance .
📗 Loss Function:
Zero-one loss minimization: , where is the optimal classifier, is the hypothesis space (set of functions to choose from).
Squared loss minimization of perceptrons: , , where is the optimal weights, is the optimal bias, is the activation function.
📗 Logistic Regression:
Logistic regression classifier: , .
Loss minimization problem: , .
Batch gradient descrent step: , , , where is the learning rate.
📗 Neural Network:
Neural network classifier for two layer network with logistic activation:
, where is the number of features (or input units), is the layer weight from input unit to hidden layer unit , is the bias for hidden layer unit , is the layer activation of instance hidden unit .
, where is the number of hidden units, is the layer weight from hidden layer unit , is the bias for the output unit, is the layer activation of instance .
Stochastic gradient descent step for two layer network with squared loss and logistic activation:
.
.
.
.
📗 Multiple Classes:
Softmax activation for one layer networks: , where is the number of classes (number of possible labels), is the activation of the output unit for instance , is component of the one-hot encoding of the label for instance .
📗 Regularization:
L1 regularization (squared loss): , where is the regularization parameter.
L2 regularization (sqaured loss): .
📗 Support Vector Machine
SVM classifier: .
Hard margin, original max-margin formulation: such that if and if .
Hard margin, simplified formulation: such that .
Soft margin, original max-margin formulation: such that , where is the slack variable for instance , is the regularization parameter.
Soft margin, simplified formulation:
Subgradient descent formula: .
📗 Kernel Trick
Kernel SVM classifier: , where is the feature map.
Kernal Gram matrix: .
Quadratic Kernel: has feature representation .
Gaussian RBF Kernel: has infinite-dimensional feature representation, where is the variance parameter.
📗 Information Theory:
Entropy: , where is the number of classes (number of possible labels), is the fraction of data points with label .
Conditional entropy: , where is the number of possible values of feature, is the fraction of data points with feature , is the fraction of data points with label among the ones with feature .
Information gain, for feature : .
📗 Decision Tree:
Decision stump classifier: , where is the threshold for feature .
Feature selection: .
📗 K-Nearest Neighbor:
Distance: (Euclidean) , (Manhattan) , where are two instances.
K-Nearest Neighbor classifier: = mode , where mode is the majority label and is the label of the -th closest instance to instance from the training set.
📗 Natural Language Processing:
Unigram model: where is the -th token in a training item, and is the total number of tokens in the item.
Maximum likelihood estimator (unigram): , where is the number of time the token appears in the training set and is the vocabulary size (number of unique tokens).
Maximum likelihood estimator (unigram, with Laplace smoothing): .
Bigram model: .
Maximum likelihood estimator (bigram): .
Maximum likelihood estimator (bigram, with Laplace smoothing): .
📗 Probability Review:
Conditional probability: .
Joint probability: .
Bayes rule: .
Law of total probability: .
Independence: are independent if for every .
Conditional independence: are conditionally independent conditioned on if for every .
📗 Bayesian Network
Conditional Probability Table estimation: , where is the list of parents of in the network.
Conditional Probability Table estimation (with Laplace smoothing): , where is the number of possible values of .
Bayesian network inference: .
Naive Bayes estimation:
.
Naive Bayes classifier: .
📗 Convolution
Convolution (1D): , , where is the filter, and is half of the width of the filter.
Convolution (2D): , , where is the filter, and is half of the width of the filter.
Sobel filter: and .
Image gradient: , , with gradient magnitude and gradient direction .
📗 Convolutional Neural Network
Fully connected layer: , where is the activation unit, is the activation function.
Convolution layer: , where is the activation map.
Pooling layer: (max-pooling) , (average-pooling) .