Young Wu's Homepage

📗 [4 points] Consider a linear model \(a_{i} = w^\top x_{i} + b\), with the cross entropy cost function \(C\) = . The initial weight is \(\begin{bmatrix} w \\ b \end{bmatrix}\) = . What is the updated weight and bias after one (stochastic) gradient descent step if the chosen training data is \(x\) = , \(y\) = ? The learning rate is .

📗 Answer (comma separated vector): .

📗 [3 points] Let \(f\) be a continuously differentiable function in \(\mathbb{R}\). If the derivative \(f'\left(x\right)\) 0 at \(x\) = . Which values of \(x'\) are possible in the next step of gradient descent if we start at \(x\) = ? You can assume the learning rate is 1.

📗 Choices:

None of the above

📗 [3 points] Which ones of the following functions are equal to the squared error for deterministic binary classification? \(C = \displaystyle\sum_{i=1}^{n} \left(f\left(x_{i}\right) - y_{i}\right)^{2}, f\left(x_{i}\right) \in \left\{0, 1\right\}, y_{i} \in \left\{0, 1\right\}\). Note: \(I_{S}\) is the indicator notation on \(S\).

📗 Note: the question is asking for the functions that are identical in values.

📗 Choices:

\(\displaystyle\sum_{i=1}^{n}\)
\(\displaystyle\sum_{i=1}^{n}\)
\(\displaystyle\sum_{i=1}^{n}\)
\(\displaystyle\sum_{i=1}^{n}\)
\(\displaystyle\sum_{i=1}^{n}\)
None of the above

📗 [4 points] Consider a linear model \(a_{i} = w^\top x_{i} + b\), with the hinge cost function . The initial weight is \(\begin{bmatrix} w \\ b \end{bmatrix}\) = . What is the updated weight and bias after one stochastic (sub)gradient descent step if the chosen training data is \(x\) = , \(y\) = ? The learning rate is .

📗 Answer (comma separated vector): .

📗 [3 points] Recall a linear SVM (Support Vector Machine) with slack variables has the objective function \(\dfrac{1}{2} w^\top w + C \displaystyle\sum_{i=1}^{n} \varepsilon_{i}\). What is the optimal \(w\) when the trade-off parameter \(C\) is 0? The training data contains only points with label 0 and with label 1. Only enter the weights, no bias.

📗 Answer (comma separated vector): .

📗 [4 points] Given a linear SVM (Support Vector Machine) that perfectly classifies a set of training data containing positive examples and negative examples. What is the maximum possible number of training examples that could be removed and still produce the exact same SVM as derived for the original training set?

📗 Answer: .

📗 [4 points] Given a linear SVM (Support Vector Machine) that perfectly classifies a set of training data containing positive examples and negative examples. What is the minimum possible number of training examples that need be removed to cause the margin of a linear SVM to increase? If the answer is impossible, enter "-1".

📗 Answer: .

📗 [4 points] Given two items \(x_{1}\) = and \(x_{2}\) = , suppose the feature map for a kernel SVM (Support Vector Machine) is \(\varphi\left(x\right)\) = , what is the kernel (Gram) matrix?

📗 Answer (matrix with multiple lines, each line is a comma separated vector): .

📗 [4 points] Given the number of instances in each class summarized in the following table, how many instances are used to train an one-vs-one SVM (Support Vector Machine) for class vs ?

\(y_{i}\)	0	1	2	3	4
Count

📗 Answer: .

📗 [4 points] Say we have a training set consisting of positive examples and negative examples where each example is a point in a two-dimensional, real-valued feature space. What will the classification accuracy be on the training set with NN (Nearest Neighbor).

📗 Answer: .

📗 [4 points] You have a data set with positive items and negative items. You perform a "leave-one-out" procedure: for each item i, learn a separate kNN (k Nearest Neighbor) classifier on all items except item i, and compute that kNN's accuracy in predicting item i. The leave-one-out accuracy is defined to be the average of the accuracy for each item. What is the leave-one-out accuracy when k = ?

📗 Answer: .

📗 [2 points] You have a dataset with unique data points (half of which are labeled 0 and the other half labeled 1) which you want to use to train a kNN (k Nearest Neighbor) classifier. You setup the experiment as follows: you train kNN classifiers: \(k\) = using all the data points. Then you randomly select data points from the training set, and classify them using each of the classifiers. Which classifier (enter the \(k\) value) will have the highest accuracy? Your answer should not depend on which random subset is selected.

📗 Answer: .

📗 [4 points] List English letters from A to Z: ABCDEFGHIJKLMNOPQRSTUVWXYZ. Define the distance between two letters in the natural way, that is \(d\left(A, A\right) = 0\), \(d\left(A, B\right) = 1\), \(d\left(A, C\right) = 2\) and so on. Each letter has a label, are labeled 0, and the others are labeled 1. This is your training data. Now classify each letter using kNN (k Nearest Neighbor) for odd \(k = 1, 3, 5, 7, ...\). What is the smallest \(k\) where all letters are classified the same (same label, i.e. either all labels are 0s or all labels are 1s). Break ties by preferring the earlier letters in the alphabet. Hint: the nearest neighbor of a letter is the letter itself.

📗 Answer: .

📗 [4 points] Given the following training data, what is the fold cross validation accuracy if NN (Nearest Neighbor) classifier with Manhattan distance is used. The first fold is the first instances, the second fold is the next instances, etc. Break the tie (in distance) by using the instance with the smaller index. Enter a number between 0 and 1.

\(x_{i}\)
\(y_{i}\)

📗 Answer: .

# X1 Practice Exam Problems

# Warning: please enter your ID before you start!

# Question 1

# Question 2

# Question 3

# Question 4

# Question 5

# Question 6

# Question 7

# Question 8

# Question 9

# Question 10

# Question 11

# Question 12

# Question 13

# Question 14

# Question 15

# Grade