📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit the "Enter" key) 1,2,3,4,5,6,7,8,9,10,11m2
📗 The official deadline is Jul 4, late submissions within a week will be accepted without penalty, but please submit a regrade request form: Link.
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could print the page: , solve the problems, then enter all your answers at the end.
📗 Please do not refresh the page: your answers will not be saved.
📗 [4 points] Consider a Linear Threshold Unit (LTU) perceptron with initial weights \(w\) = and bias \(b\) = trained using the Perceptron Algorithm. Given a new input \(x\) = and \(y\) = . Let the learning rate be \(\alpha\) = , compute the updated weights, \(w', b'\) = :
Hint
See Spring 2018 Final Q7, Spring 2017 Final Q3. The perceptron learning formula using the notations in this question is: \(w' = w - \alpha \left(a - y\right) x\) and \(b' = b - \alpha \left(a - y\right)\) where \(a = 1_{\left\{w^\top x + b \geq 0\right\}}\). Note that this is not a gradient descent procedure: it just happens to use a similar formula.
📗 Answer (comma separated vector): .
📗 [6 points] With a linear threshold unit perceptron, implement the following function. That is, you should write down the weights \(w_{0}, w_{A}, w_{B}\). Enter the bias first, then the weights on A and B.
A
B
function
0
0
0
1
1
0
1
1
📗 You can plot your line given by \(w_{0}, w_{A}, w_{B}\) to see if it separates the dataset correctly: . If no green line shows up, it means the entire line is outside of the range [0, 1].
Hint
See Fall 2011 Final Q10, Spring 2018 Final Q4, Fall 2006 Final Q16, Fall 2005 Final Q16. There can many possible answers. A possible answer should satisfy the following conditions: \(1_{\left\{w_{0} + 0 w_{A} + 0 w_{B} \geq 0\right\}} = f\left(A = 0, B = 0\right)\), \(1_{\left\{w_{0} + 0 w_{A} + 1 w_{B}\right\}} = f\left(A = 0, B = 1\right)\), \(1_{\left\{w_{0} + 1 w_{A} + 0 w_{B}\right\}} = f\left(A = 1, B = 0\right)\), \(1_{\left\{w_{0} + 1 w_{A} + 1 w_{B}\right\}} = f\left(A = 1, B = 1\right)\).
📗 Answer (comma separated vector): .
📗 [3 points] What is the minimum number of training items that needs to be removed so that a Perceptron can learn the remaining training set (with accuracy 100 percent)?
\(x_{1}\)
\(x_{2}\)
\(x_{3}\)
\(y\)
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1
Hint
See Fall 2019 Final Q9 Q10, Spring 2018 Fall Q17, Fall 2009 Final Q17 Q19, Fall 2008 Final Q3. A perceptron can learn only if the training set is linearly separable (i.e. separated by a plane).
📗 Answer: .
📗 [3 points] What is the minimum zero-one cost of a binary (y is either 0 or 1) linear (threshold) classifier (for example, LTU perceptron) on the following data set?
\(x_{i}\)
1
2
3
4
5
6
\(y_{i}\)
📗 A linear classifier is a vertical line that separates the two classes: you want to draw the line such that the least number of mistakes (i.e. zero-one cost) are made.
Hint
The zero-one cost is \(C = \displaystyle\min_{b} \displaystyle\sum_{i=1}^{n} 1_{\left\{\hat{y}_{i} \neq y_{i}\right\}}\), where \(\hat{y}_{i}\) is the prediction of the classifier. A linear classifier with threshold \(t\) is either in form (1) \(\hat{y}_{i} = 0\) when \(x_{i} \leq t\) and \(\hat{y}_{i} = 1\) when \(x_{i} > t\) or (2) \(\hat{y}_{i} = 1\) when \(x_{i} \leq t\) and \(\hat{y}_{i} = 0\) when \(x_{i} > t\). For this question, you can try \(t = 0, 1, 2, 3, 4, 5, 6, 7\) and check which one leads to the smallest zero-one cost.
📗 Answer: .
📗 [3 points] Let \(x = \left(x_{1}, x_{2}, x_{3}\right)\). We want to minimize the objective function \(f\left(x\right)\) = using gradient descent. Let the stepsize \(\eta\) = . If we start at the vector \(x^{\left(0\right)}\) = , what is the next vector \(x^{\left(1\right)}\) produced by gradient descent?
Hint
See Fall 2017 Final Q15, Fall 2010 Final Q5, Fall 2006 Midterm Q11, Fall 2005 Midterm Q5. The gradient descent formula using the notations in this question is: \(x^{\left(1\right)} = x^{\left(0\right)} - \eta \nabla f\left(x^{\left(0\right)}\right)\) where \(\nabla f\left(x^{\left(0\right)}\right) = \begin{bmatrix} \dfrac{\partial f}{\partial x^{\left(0\right)}_{1}} \\ \dfrac{\partial f}{\partial x^{\left(0\right)}_{2}} \\ \dfrac{\partial f}{\partial x^{\left(0\right)}_{3}} \end{bmatrix}\).
📗 Answer (comma separated vector): .
📗 [3 points] We use gradient descent to find the minimum of the function \(f\left(x\right)\) = with step size \(\eta > 0\). If we start from the point \(x_{0}\) = , how small should \(\eta\) be so we make progress in the first iteration? Check all values of \(\eta\) that make progress.
Hint
See Fall 2017 Final Q7, Fall 2014 Midterm Q17, Fall 2013 Final Q10. The minimum is 0, so "making progress" means getting closer to 0 in the first iteration. The gradient descent formula using the notations in this question is: \(x_{1} = x_{0} - \eta f'\left(x_{0}\right)\). The learning rate \(\eta\) that makes progress should satisfy \(\left| x_{0} - \eta f'\left(x_{0}\right) \right| < \left| x_{0} \right|\).
\(\eta\) = 00
📗 The green point is the current \(x_{0}\). You can change the \(\eta\) values using the slider and see the x value in the next iteration as the red point and check whether it gets closer to the minimum.
📗 Choices:
None of the above
📗 Calculator: .
📗 [3 points] Let \(g\left(z\right) = \dfrac{1}{1 + \exp\left(-z\right)}, z = w^\top x = w_{1} x_{1} + w_{2} x_{2} + ... + w_{d} x_{d}\), \(d\) = be a sigmoid perceptron with inputs \(x_{1} = ... = x_{d}\) = and weights \(w_{1} = ... = w_{d}\) = . There is no bias term. If the desired output is \(y\) = , and the sigmoid perceptron update rule has a learning rate of \(\alpha\) = , what will happen after one step of update? Each \(w_{i}\) will change by (enter a number, positive for increase and negative for decrease).
Hint
See Fall 2016 Final Q15, Fall 2011 Midterm Q11. The change for each \(w_{i}\) is \(-\alpha \left(a - y\right) x_{i}\) where \(a = g\left(z\right), z = w^\top x\). There is no bias added to the \(z\) term here.
📗 Answer: .
📗 [2 points] Consider a rectified linear unit (ReLU) with input \(x\) and a bias term. The output can be written as \(y\) = . Here, the weight is and the bias is . Write down the input value \(x\) that produces a specific output \(y\) = .
📗 The red curve is a plot of the activation function, given the y-value of the green point, the question is asking for its x-value.
Hint
See Fall 2017 Final Q23. If \(y > 0\), there is a unique \(x\) that solves \(y = \displaystyle\max\left(0, w_{0} + w_{1} x\right) = w_{0} + w_{1} x\). If \(y < 0\), there are no \(x\) that solves the expression. If \(y = 0\), the set of \(x\) that solves the expression is given by \(0 \geq w_{0} + w_{1} x\), you can find the largest and smallest value of this set.
📗 Answer: .
📗 [2 points] Consider a single sigmoid perceptron with bias weight \(w_{0}\) = , a single input \(x_{1}\) with weight \(w_{1}\) = , and the sigmoid activation function \(g\left(z\right) = \dfrac{1}{1 + \exp\left(-z\right)}\). For what input \(x_{1}\) does the perceptron output value \(a\) = .
📗 The red curve is a plot of the activation function, given the y-value of the green point, the question is asking for its x-value.
Hint
See Fall 2012 Final Q8, Fall 2014 Midterm Q16. Using the notations in this question: \(z = w_{0} + w_{1} x_{1}\) is the linear part, and \(a\) is the output or activation in the lectures. There should be a unique \(x_{1}\) that satisfies the expression \(\dfrac{1}{1 + \exp\left(-\left(w_{0} + w_{1} x_{1}\right)\right)} = a\).
📗 Note: Math.js does not accept "ln(...)", please use "log(...)" instead.
📗 Answer: .
📗 [3 points] Which functions are (weakly) convex on \(\mathbb{R}\)?
Hint
See Fall 2014 Final Q4, Fall 2005 Midterm Q6. Either plot the functions or find the ones with non-negative second derivative (i.e. positive semi-definite Hessian matrix in higher dimensions). Strictly convex functions are the functions with strictly positive second derivative. Weakly convex functions are the functions with weakly positive (i.e. non-negative) second derivative. You can see the formal definitions on Wikipedia.
📗 You can plot an expression of x: using from to .
📗 Choices:
None of the above
📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the questions that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.
📗 Please do not modify the content in the above text field: use the "Grade" button to update.
📗 Please wait for the message "Successful submission." to appear after the "Submit" button. If there is an error message or no message appears after 10 seconds, please save the text in the above text box to a file using the button or copy and paste it into a file yourself and submit it to Canvas Assignment M2. You could submit multiple times (but please do not submit too often): only the latest submission will be counted.
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##m: 2" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.
📗 Some of the past exams referenced in the Hints can be found on Professor Zhu, Professor Liang and Professor Dyer's websites: Link, and Link.
📗 Some of the questions are from last year, and I recorded videos going through them, the links are at the bottom of the Week 1 to Week 8 pages, for example: W4 and W8.
📗 The links to the solutions the students volunteered to share on Piazza will be collected in this post around the official due date: Link.