📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit enter key) 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15m2
📗 You can also load from your saved file and click .
📗 If the questions are not generated correctly, try refresh the page using the button at the top left corner.
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could print the page: , solve the problems, then enter all your answers at the end.
📗 Please do not refresh the page: your answers will not be saved.
📗 [4 points] (NEW) Suppose stochastic sub-gradient descent (SGD) is used for linear regression with loss \(C_{i} = \left| a_{i} - y_{i} \right|\), where \(a_{i} = w x_{i} + b\). The current weight is \(w\) = and bias is \(b\) = . The learning rate is \(\alpha\) = . What is the updated weight and bias after one step of SGD on the item \(x_{i}\) = , \(y_{i}\) = ? In case there are multiple sub-derivative values, use the one with the smallest absolute value.
📗 Note: this is called LAD regression, and SGD should not be used in practice since the problem is a linear program and can be efficiently solved with linear programming algorithms.
📗 Answer (comma separated vector): .
📗 [2 points] (NEW) Suppose dataset contains images of cats and images of dogs, and KNN (Nearest Neighbor) is trained on this dataset and predicts always predict any new image as cat. What is the smallest value of \(K\)? Your answer should be an odd number and should not depend on specific properties of the dataset.
📗 Answer: .
📗 [4 points] (NEW) Given the following conditional entropy values, what distribution of \(X\) would maximize the information gain \(I\left(Y | X\right)\)? Assume \(H\left(Y\right)\) = . Enter a vector of probabilities between 0 and 1 that sum up to 1: \(\mathbb{P}\left\{X = 1\right\}, \mathbb{P}\left\{X = 2\right\}, \mathbb{P}\left\{X = 3\right\}, \mathbb{P}\left\{X = 4\right\}\).
Conditional Entropy
\(H\left(Y | X = 1\right)\)
\(H\left(Y | X = 2\right)\)
\(H\left(Y | X = 3\right)\)
\(H\left(Y | X = 4\right)\)
-
📗 Answer (comma separated vector):
📗 [2 points] (NEW) If linear probability model is used for a classification problem (that is, linear regression is performed to estimate the probability that an item belongs class 1), and the trained model is given by where \(w\) = and \(b\) = . Which of the following training item has the largest cost (square loss is used)? Enter an index \(1, 2, 3, 4, 5\). In case of ties, enter the one with the smallest index.
Index \(i\)
1
2
3
4
5
Feature \(x_{i}\)
Label \(y_{i}\)
📗 Answer: .
📗 [4 points] Fill in the missing values in the following joint probability table so that A and B are independent.
-
A = 0
A = 1
B = 0
B = 1
??
??
📗 Answer (comma separated vector): .
📗 [4 points] "It" has a house with many doors. A random door is about to be opened with equal probability. Doors to have monsters that eat people. Doors to are safe. With sufficient bribe, Pennywise will answer your question "Will door 1 be opened?" What's the information gain (also called mutual information) between Pennywise's answer and your encounter with a monster?
📗 Answer: .
📗 [3 points] Given the following training set, what is the maximum accuracy of a decision tree with depth 1 trained on this set? Enter a number between 0 and 1.
index
\(x_{1}\)
\(y\)
1
2
3
4
5
6
📗 Answer: .
📗 [4 points] Given the following training set, add one item \(\begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix}\) with \(y\) = so that all 7 items are support vectors for the Hard Margin SVM (Support Vector Machine) trained on the new training set.
\(x_{1}\)
\(x_{2}\)
\(y\)
0
0
0
1
1
1
📗 Answer (comma separated vector): .
📗 [2 points] Suppose an SVM (Support Vector Machine) has \(w\) = and \(b\) = . What is the actual distance between the two planes defined by \(w^\top x + b\) = and \(w^\top x + b\) = . Note that these are not the SVM plus and minus planes.
📗 Answer: .
📗 [4 points] Given the two training points and and their labels \(0\) and \(1\). What is the kernel (Gram) matrix if the Sigmoid kernel \(K_{i i'} = tanh\left(\sigma \cdot \left(x_{i}\right)^\top \left(x_{i'}\right) + 1\right)\) is used with \(\sigma\) = . Note: you can either leave "tanh" in your answer or evaluate it as \(tanh\left(x\right) = \dfrac{e^{x} - e^{-x}}{e^{x} + e^{-x}} = \dfrac{e^{2 x} - 1}{e^{2 x} + 1}\).
📗 Answer (matrix with multiple lines, each line is a comma separated vector): .
📗 [3 points] In one iteration of the Perceptron Algorithm, \(x\) = , \(y\) = , and predicted label \(\hat{y} = a\) = . The learning rate \(\alpha = 1\). After the iteration, how many of the weights (include bias \(b\)) are increased (the change is strictly larger than 0). If it is impossible to figure out given the information, enter -1.
📗 Answer: .
📗 [3 points] In one step of gradient descent for a \(L_{2}\) regularized logistic regression, suppose \(w\) = , \(b\) = , and \(\dfrac{\partial C}{\partial w}\) = , \(\dfrac{\partial C}{\partial b}\) = . If the learning rate is \(\alpha\) = and the regularization parameter is \(\lambda\) = , what is \(w\) after one iteration? Use the loss \(C\left(w, b\right)\) and the regularization \(\dfrac{\lambda}{2} \left\|\begin{bmatrix} w \\ b \end{bmatrix}\right\|_{2}^{2}\) = \(\dfrac{\lambda}{2} \left(w^{2} + b^{2}\right)\).
📗 Answer: .
📗 [4 points] Given the following neural network that classifies all the training instances correctly. What are the labels (0 or 1) of the training data? The activation functions are LTU for all units: \(1_{\left\{z \geq 0\right\}}\). The first layer weight matrix is , with bias vector , and the second layer weight vector is , with bias
\(x_{i1}\)
\(x_{i2}\)
\(y_{i}\) or \(a^{\left(2\right)}_{1}\)
0
0
?
0
1
?
1
0
?
1
1
?
Note: if the weights are not shown clearly, you could move the nodes around with mouse or touch.
📗 Answer (comma separated vector): .
📗 [3 points] Suppose you are given a neural network with hidden layers, input units, output units, and hidden units. In one backpropogation step when computing the gradient of the cost (for example, squared loss) with respect to \(w^{\left(1\right)}_{11}\), the weight in layer \(1\) connecting input \(1\) and hidden unit \(1\), how many weights (including \(w^{\left(1\right)}_{11}\) itself, and including biases) are used in the backpropogation step of \(\dfrac{\partial C}{\partial w^{\left(1\right)}_{11}}\)?
📗 Note: the backpropogation step assumes the activations in all layers are already known so do not count the weights and biases in the forward step computing the activations.
📗 Answer: .
📗 [1 points] Please enter any comments including possible mistakes and bugs with the questions or your answers. If you have no comments, please enter "None": do not leave it blank.
📗 Please do not modify the content in the above text field: use the "Grade" button to update.
📗 Please wait for the message "Successful submission." to appear after the "Submit" button. Please also save the text in the above text box to a file using the button or copy and paste it into a file yourself and submit it to Canvas Assignment X1. You could submit multiple times (but please do not submit too often): only the latest submission will be counted.
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##x: 2" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.