📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit enter key) 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50x1
📗 If the questions are not generated correctly, try refresh the page using the button at the top left corner.
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could print the page: , solve the problems, then enter all your answers at the end.
📗 Please do not refresh the page: your answers will not be saved.
📗 [4 points] Consider a Linear Threshold Unit (LTU) perceptron with initial weights \(w\) = and bias \(b\) = trained using the Perceptron Algorithm. Given a new input \(x\) = and \(y\) = . Let the learning rate be \(\alpha\) = , compute the updated weights, \(w', b'\) = :
📗 Answer (comma separated vector): .
📗 [3 points] Let \(g\left(z\right) = \dfrac{1}{1 + \exp\left(-z\right)}, z = w^\top x = w_{1} x_{1} + w_{2} x_{2} + ... + w_{d} x_{d}\), \(d\) = be a sigmoid perceptron with inputs \(x_{1} = ... = x_{d}\) = and weights \(w_{1} = ... = w_{d}\) = . There is no bias term. If the desired output is \(y\) = , and the sigmoid perceptron update rule has a learning rate of \(\alpha\) = , what will happen after one step of update? Each \(w_{i}\) will change by (enter a number, positive for increase and negative for decrease).
📗 Answer: .
📗 [6 points] With a linear threshold unit perceptron, implement the following function. That is, you should write down the weights \(w_{0}, w_{A}, w_{B}\). Enter the bias first, then the weights on A and B.
A
B
function
0
0
0
1
1
0
1
1
📗 Answer (comma separated vector): .
📗 [3 points] What is the minimum number of training items that needs to be removed so that a Perceptron can learn the remaining training set (with accuracy 100 percent)?
\(x_{1}\)
\(x_{2}\)
\(x_{3}\)
\(y\)
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1
📗 Answer: .
📗 [3 points] In one iteration of the Perceptron Algorithm, \(x\) = , \(y\) = , and predicted label \(\hat{y} = a\) = . The learning rate \(\alpha = 1\). After the iteration, how many of the weights (include bias \(b\)) are increased (the change is strictly larger than 0). If it is impossible to figure out given the information, enter -1.
📗 Answer: .
📗 [3 points] In one iteration of the Perceptron Algorithm, the initial weights are \(w\) = and \(b\) = , with \(x\) = , \(y \in \left\{0, 1\right\}\), and learning rate \(\alpha = 1\). After the iteration, the weights remain unchanged. What is the correct label \(y\)? The LTU perceptron classifier is \(1_{\left\{w x + b \geq 0\right\}}\).
📗 Answer: .
📗 [2 points] Consider a rectified linear unit (ReLU) with input \(x\) and a bias term. The output can be written as \(y\) = . Here, the weight is and the bias is . Write down the input value \(x\) that produces a specific output \(y\) = .
📗 The red curve is a plot of the activation function, given the y-value of the green point, the question is asking for its x-value.
📗 Answer: .
📗 [2 points] Consider a single sigmoid perceptron with bias weight \(w_{0}\) = , a single input \(x_{1}\) with weight \(w_{1}\) = , and the sigmoid activation function \(g\left(z\right) = \dfrac{1}{1 + \exp\left(-z\right)}\). For what input \(x_{1}\) does the perceptron output value \(a\) = .
📗 The red curve is a plot of the activation function, given the y-value of the green point, the question is asking for its x-value.
📗 Note: Math.js does not accept "ln(...)", please use "log(...)" instead.
📗 Answer: .
📗 [4 points] Suppose the squared loss is used to do stochastic gradient descent for logistic regression, i.e. \(C = \dfrac{1}{2} \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)^{2}\) where \(a_{i} = \dfrac{1}{1 + e^{- w x_{i} - b}}\). Given the current weight \(w\) = and bias \(b\) = , with \(x_{i}\) = , \(y_{i}\) = , \(a_{i}\) = (no need to recompute this value), with learning rate \(\alpha\) = . What is the updated after the iteration? Enter a single number.
📗 Answer: .
📗 [3 points] We use gradient descent to find the minimum of the function \(f\left(x\right)\) = with step size \(\eta > 0\). If we start from the point \(x_{0}\) = , how small should \(\eta\) be so we make progress in the first iteration? Enter the largest number of \(\eta\) below which we make progress. For example, if we make progress when \(\eta < 0.01\), enter \(0.01\).
📗 Answer: .
📗 [3 points] Let \(x = \left(x_{1}, x_{2}, x_{3}\right)\). We want to minimize the objective function \(f\left(x\right)\) = using gradient descent. Let the stepsize \(\eta\) = . If we start at the vector \(x^{\left(0\right)}\) = , what is the next vector \(x^{\left(1\right)}\) produced by gradient descent?
📗 Answer (comma separated vector): .
📗 [2 points] Alice, Bob and Cindy go to the same school and live on a straight street lined with evenly spaced telephone poles. Alice's house is at the pole , Bob's is at the pole , Cindy's is at the pole . Where should the school set up a school bus stop so that the sum of distances (from house to bus stop) walked by the three students is minimized?
📗 Answer: .
📗 [3 points] Which functions are (weakly) convex on \(\mathbb{R}\)?
📗 Choices:
None of the above
📗 [1 points] A binary classifier is trained on a training set, and the resulting classifier is: \(\hat{y} = 1\) if \(a x_{1} + b x_{2} + c \geq 0\) and \(\hat{y} = 0\) otherwise, and tested its performance on a separate test set. The accuracy of the classifier is . What is accuracy if the flipped classifier (\(\hat{y} = 1\) if \(a x_{1} + b x_{2} + c < 0\) and \(\hat{y} = 0\) otherwise) is used?
📗 Enter a fraction to represent the accuracy, for example, enter 0.5 if the accuracy is 50 percent and enter 1 if the accuracy is 100 percent.
📗 Answer: .
📗 [2 points] A test set \(\left(x_{1}, y_{1}\right), ..., \left(x_{100}, y_{100}\right)\) contains labels \(y_{i}\) = for \(i = 1, ..., 100\). A classifier simply predicts all the time (the labels are +1 and -1). What is this classifier's test accuracy?
📗 Enter a fraction to represent the accuracy, for example, enter 0.5 if the accuracy is 50 percent and enter 1 if the accuracy is 100 percent.
📗 Answer: .
📗 [3 points] What is the minimum zero-one cost of a binary (y is either 0 or 1) linear (threshold) classifier (for example, LTU perceptron) on the following data set?
\(x_{i}\)
1
2
3
4
5
6
\(y_{i}\)
📗 A linear classifier is a vertical line that separates the two classes: you want to draw the line such that the least number of mistakes (i.e. zero-one cost) are made.
📗 Answer: .
📗 [3 points] Which ones of the following functions are equal to the squared error for deterministic binary classification? \(C = \displaystyle\sum_{i=1}^{n} \left(f\left(x_{i}\right) - y_{i}\right)^{2}, f\left(x_{i}\right) \in \left\{0, 1\right\}, y_{i} \in \left\{0, 1\right\}\). Note: \(I_{S}\) is the indicator notation on \(S\).
📗 Note: the question is asking for the functions that are identical in values.
📗 Choices:
\(\displaystyle\sum_{i=1}^{n}\)
\(\displaystyle\sum_{i=1}^{n}\)
\(\displaystyle\sum_{i=1}^{n}\)
\(\displaystyle\sum_{i=1}^{n}\)
\(\displaystyle\sum_{i=1}^{n}\)
None of the above
📗 [3 points] Let \(f\) be a continuously differentiable function in \(\mathbb{R}\). If the derivative \(f'\left(x\right)\) 0 at \(x\) = . Which values of \(x'\) are possible in the next step of gradient descent if we start at \(x\) = ? You can assume the learning rate is 1.
📗 Choices:
None of the above
📗 [3 points] Suppose there is a single integer input \(x\) = {\(0\), \(1\), ..., }, and the label is binary \(y\) = {\(0\), \(1\)}. Let \(\mathcal{H}\) be a hypothesis space containing all possible linear classifiers. How many unique classifiers are there in \(\mathcal{H}\)? For example, the three linear classifiers \(1_{\left\{x < 0.4\right\}}\), \(1_{\left\{x \leq 0.4\right\}}\) and \(1_{\left\{x < 0.6\right\}}\) are considered the same classifier since they classify all possible data sets the same way.
📗 Answer: .
📗 [0 points] To be added.
📗 [2 points] Let the input \(x \in \mathbb{R}\). Thus the input layer has a single \(x\) input. The network has 5 hidden layers. Each hidden layer has 10 units. The output layer has a single unit and outputs \(y \in \mathbb{R}\). Between layers, the network is fully connected. All units in the network have a bias input. All units are linear units, namely the activation function is the identity function \(a = g\left(z\right) = z\), while \(z = w^\top x + b\) is a linear combination of all inputs to that unit (including the bias). Which functions can this network compute?
📗 Choices:
None of the above
📗 [2 points] In a three-layer (fully connected) neural network, the first hidden layer contains sigmoid units, the second hidden layer contains units, and the output layer contains units. The input is dimensional. How many weights plus biases does this neural network have? Enter one number.
📗 Answer: .
📗 [4 points] Fill in the missing weight below so that it computes the following function. All inputs takes value 0 or 1, and the perceptrons are linear threshold units. The first layer weight matrix is , with bias vector , and the second layer weight vector is , with bias .
\(x_{1}\)
\(x_{2}\)
\(y\) or \(a^{\left(2\right)}_{1}\)
0
0
0
1
1
0
1
1
📗 Note: if the weights are not shown clearly, you could move the nodes around with mouse or touch.
📗 Answer: .
📗 [4 points] Fill in the missing weight below so that it computes the following function. All inputs takes value 0 or 1, and the perceptrons are linear threshold units. The first layer weight matrix is , with bias vector , and the second layer weight vector is , with bias .
\(x_{1}\)
\(x_{2}\)
\(y\) or \(a^{\left(2\right)}_{1}\)
0
0
0
1
1
0
1
1
📗 Note: if the weights are not shown clearly, you could move the nodes around with mouse or touch.
📗 Answer: .
📗 [2 points] We have a biased coin with probability of producing Heads. We create a predictor as follows: generate a random number uniformly distributed in (0, 1). If the random number is less than we predict Heads, otherwise, we predict Tails. What is this predictor's (expected) accuracy in predicting the coin's outcome?
📗 Answer: .
📗 [1 points] You want to design a neural network with sigmoid units to predict the academic role from his webpage. Possible roles are "professor" (label 0), "student" (label 1), "staff" (label 2). Suppose each person can take on only one of these roles at the same time. The neural network uses one-hot encoding, label 0 is encoded by \(\begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}\), label 1 is encoded by \(\begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}\), and label 2 is encoded by \(\begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}\). What is the role (enter a label, not a string) if the output is ?
📗 Answer: .
📗 [4 points] Given the following neural network that classifies all the training instances correctly. What are the labels (0 or 1) of the training data? The activation functions are LTU for all units: \(1_{\left\{z \geq 0\right\}}\). The first layer weight matrix is , with bias vector , and the second layer weight vector is , with bias
\(x_{i1}\)
\(x_{i2}\)
\(y_{i}\) or \(a^{\left(2\right)}_{1}\)
0
0
?
0
1
?
1
0
?
1
1
?
Note: if the weights are not shown clearly, you could move the nodes around with mouse or touch.
📗 Answer (comma separated vector): .
📗 [3 points] Suppose you are given a neural network with hidden layers, input units, output units, and hidden units. In one backpropogation step when computing the gradient of the cost (for example, squared loss) with respect to \(w^{\left(1\right)}_{11}\), the weight in layer \(1\) connecting input \(1\) and hidden unit \(1\), how many weights (including \(w^{\left(1\right)}_{11}\) itself, and including biases) are used in the backpropogation step of \(\dfrac{\partial C}{\partial w^{\left(1\right)}_{11}}\)?
📗 Note: the backpropogation step assumes the activations in all layers are already known so do not count the weights and biases in the forward step computing the activations.
📗 Answer: .
📗 [4 points] Consider a linear model \(a_{i} = w^\top x_{i} + b\), with the cross entropy cost function \(C\) = . The initial weight is \(\begin{bmatrix} w \\ b \end{bmatrix}\) = . What is the updated weight and bias after one (stochastic) gradient descent step if the chosen training data is \(x\) = , \(y\) = ? The learning rate is .
📗 You could save the text in the above text box to a file using the button or copy and paste it into a file yourself .
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##x: 1" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.