📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click 1,2,3,4,5,6,7,8,9,101
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could print the page: , solve the problems, then enter all your answers at the end.
📗 Please do not refresh the page: your answers will not be saved. You can save and load your answers (only fill-in-the-blank questions) using the buttons at the bottom of the page.
📗 [3 points] Let \(g\left(z\right) = \dfrac{1}{1 + \exp\left(-z\right)}, z = w^\top x = w_{1} x_{1} + w_{2} x_{2} + ... + w_{d} x_{d}\), \(d\) = be a sigmoid perceptron with inputs \(x_{1} = ... = x_{d}\) = and weights \(w_{1} = ... = w_{d}\) = . There is no bias term. If the desired output is \(y\) = , and the sigmoid perceptron update rule has a learning rate of \(\alpha\) = , what will happen after one step of update? Each \(w_{i}\) will change by (enter a number, positive for increase and negative for decrease).
Hint
See Fall 2016 Final Q15, Fall 2011 Midterm Q11. The change for each \(w_{i}\) is \(-\alpha \left(a - y\right) x_{i}\) where \(a = g\left(z\right), z = w^\top x\). There is no bias added to the \(z\) term here.
📗 Answer: .
📗 [2 points] In a three-layer (fully connected) neural network, the first hidden layer contains sigmoid units, the second hidden layer contains units, and the output layer contains units. The input is dimensional. How many weights plus biases does this neural network have? Enter one number.
📗 The above is a diagram of the network, the nodes labelled "1" are the bias units.
Hint
See Fall 2019 Final Q14, Fall 2013 Final Q8, Fall 2006 Final Q17, Fall 2005 Final Q17. Three-layer neural networks have one input layer (same number of units as the input dimension), two hidden layers, and one output layer (usually the same number of units as the number of classes (labels), but in case there are only two classes, the number of units can be 1). We are using the convention of calling neural networks with four layers "three-layer neural networks" because there are only three layers with weights and biases (so we don't count the input layer). The number of weights between two consecutive layers (\(m_{1}\) units in the previous layer, \(m_{2}\) units in the next layer) is \(m_{1} \cdot m_{2}\), and the number of biases is \(m_{2}\).
📗 Answer: .
📗 [4 points] If \(K\left(x, x'\right)\) is a kernel with induced feature representation \(\varphi\left(x_{0}\right)\) = , and \(G\left(x, x'\right)\) is another kernel with induced feature representation \(\theta\left(x_{0}\right)\) = , then it is known that \(H\left(x, x'\right) = a K\left(x, x'\right) + b G\left(x, x'\right)\), \(a\) = , \(b\) = is also a kernel. What is the induced feature representation of \(H\) for this \(x_{0}\)?
Hint
See Fall 2014 Midterm Q15, Fall 2013 Final Q7, Fall 2011 Midterm 9. This requires guess and check: suppose the feature representation is \(\begin{bmatrix} \sqrt{a} \varphi\left(x\right) \\ \sqrt{b} \theta\left(x\right) \end{bmatrix}\), then \(H\left(x, x'\right) = \begin{bmatrix} \sqrt{a} \varphi\left(x\right) & \sqrt{b} \theta\left(x\right) \end{bmatrix} \begin{bmatrix} \sqrt{a} \varphi\left(x'\right) \\ \sqrt{b} \theta\left(x'\right) \end{bmatrix} = \sqrt{a} \sqrt{a} \varphi^\top\left(x\right) \varphi\left(x'\right) + \sqrt{b} \sqrt{b} \theta^\top\left(x\right) \theta\left(x'\right) = a K\left(x, x'\right) + b G\left(x, x'\right)\).
📗 Answer (comma separated vector): .
📗 [4 points] "It" has a house with many doors. A random door is about to be opened with equal probability. Doors to have monsters that eat people. Doors to are safe. With sufficient bribe, Pennywise will answer your question "Will door 1 be opened?" What's the information gain (also called mutual information) between Pennywise's answer and your encounter with a monster?
Hint
See Fall 2012 Final Q5, Fall 2011 Midterm Q5. Compute the information gain based on entropy of Toruks (call it \(Y\) where \(Y = 1\) is the event that there is a Toruk in the cell) and conditional entropy of Toruks given whether you are in cell 1 (call it \(Y | X\) where \(X = 1\) is the event that you are in cell 1). Then the information gain is \(I = H\left(Y\right) - H\left(Y | X\right)\), where \(H\left(Y\right) = \mathbb{P}\left\{Y = 0\right\} \log_{2}\left(\mathbb{P}\left\{Y = 0\right\}\right) + \mathbb{P}\left\{Y = 1\right\} \log_{2}\left(\mathbb{P}\left\{Y = 1\right\}\right)\) and \(H\left(Y | X\right) = \mathbb{P}\left\{X = 0\right\} H\left(Y | X = 0\right) + \mathbb{P}\left\{X = 1\right\} H\left(Y | X = 1\right)\) where \(H\left(Y | X = 0\right) = \mathbb{P}\left\{Y = 0 | X = 0\right\} \log_{2}\left(\mathbb{P}\left\{Y = 0 | X = 0\right\}\right) + \mathbb{P}\left\{Y = 1 | X = 0\right\} \log_{2}\left(\mathbb{P}\left\{Y = 1 | X = 0\right\}\right)\) and \(H\left(Y | X = 1\right) = \mathbb{P}\left\{Y = 0 | X = 1\right\} \log_{2}\left(\mathbb{P}\left\{Y = 0 | X = 1\right\}\right) + \mathbb{P}\left\{Y = 1 | X = 1\right\} \log_{2}\left(\mathbb{P}\left\{Y = 1 | X = 1\right\}\right)\). Here, \(\mathbb{P}\left\{Y = 1\right\}\) is the probability that there is a Toruk (i.e. the number of Toruks divided by the number of cells), \(\mathbb{P}\left\{X = 1\right\}\) is the probability that you are in cell 1 (i.e. 1 divided by the number of cells), \(\mathbb{P}\left\{Y = 1 | X = 1\right\}\) is the probability that there is a Toruk given you are in cell 1 (which is always 1), and \(\mathbb{P}\left\{Y = 1 | X = 0\right\}\) is the probability that there is a Toruk given you are not in cell 1 (i.e. the number of Toruks not in cell 1 divided by the number of cells that are not cell 1).
📗 Answer: .
📗 [4 points] You have a data set with positive items and negative items. You perform a "leave-one-out" procedure: for each item i, learn a separate kNN (k Nearest Neighbor) classifier on all items except item i, and compute that kNN's accuracy in predicting item i. The leave-one-out accuracy is defined to be the average of the accuracy for each item. What is the leave-one-out accuracy when k = ?
Hint
See Fall 2011 Final Q20.
📗 Answer: .
📗 [4 points] Given the following neural network that classifies all the training instances correctly. What are the labels (0 or 1) of the training data? The activation functions are LTU for all units: \(1_{\left\{z \geq 0\right\}}\). The first layer weight matrix is , with bias vector , and the second layer weight vector is , with bias
\(x_{i1}\)
\(x_{i2}\)
\(y_{i}\) or \(a^{\left(2\right)}_{1}\)
0
0
?
0
1
?
1
0
?
1
1
?
Note: if the weights are not shown clearly, you could move the nodes around with mouse or touch. Hint
See Fall 2010 Final Q17. First compute the hidden layer units: \(h_{j} = 1_{\left\{w^{\left(1\right)}_{1j} x_{1} + w^{\left(1\right)}_{2j} x_{2} + b_{j} \geq 0\right\}}\), then compute the outputs (which are equal to the training data labels): \(y = o_{1} = 1_{\left\{w^{\left(2\right)}_{1} h_{1} + w^{\left(2\right)}_{2} h_{2} + b \geq 0\right\}}\). Repeat the computations for \(\left(x_{1}, x_{2}\right) = \left(0, 0\right), \left(0, 1\right), \left(1, 0\right), \left(1, 1\right)\).
📗 Answer (comma separated vector): .
📗 [4 points] John tells his professor that he forgot to submit his homework assignment. From experience, the professor knows that students who finish their homework on time forget to turn it in with probability . She also knows that of the students who have not finished their homework will tell her they forgot to turn it in. She thinks that of the students in this class completed their homework on time. What is the probability that John is telling the truth (i.e. he finished it given that he forgot to submit it)?
Hint
See Fall 2019 Final Q18 Q19, Fall 2017 Final Q6. Let \(C\) represent finishing (completing) homework and \(F\) represent forgetting to turn it. Then the question is asking \(\mathbb{P}\left\{C | F\right\} = \dfrac{\mathbb{P}\left\{C, F\right\}}{\mathbb{P}\left\{F\right\}} = \dfrac{\mathbb{P}\left\{F | C\right\} \mathbb{P}\left\{C\right\}}{\mathbb{P}\left\{F | C\right\} \mathbb{P}\left\{C\right\} + \mathbb{P}\left\{F | \neg C\right\} \left(1 - \mathbb{P}\left\{C\right\}\right)}\) due to the law of total probabilities.
📗 Answer: .
📗 [4 points] What is the conditional entropy \(H\left(B|A\right)\) for the following set of training examples.
item
A
B
1
2
3
4
5
6
7
8
Hint
See Fall 2019 Midterm Q28 Q29, Spring 2018 Midterm Q8, Spring 2017 Midterm Q7, Fall 2006 Final Q13, Fall 2006 Midterm Q10, Fall 2006 Final Q13, Fall 2006 Midterm Q10.
📗 Answer: .
📗 [4 points] Consider the problem of detecting if an email message contains a virus. Say we use four random variables to model this problem: Boolean (binary) class variable \(V\) indicates if the message contains a virus or not, and three Boolean feature variables: \(A, B, C\). We decide to use a Naive Bayes Classifier to solve this problem so we create a Bayesian network with arcs from \(V\) to each of \(A, B, C\). Their associated CPTs (Conditional Probability Table) are created from the following data: \(\mathbb{P}\left\{V = 1\right\}\) = , \(\mathbb{P}\left\{A = 1 | V = 1\right\}\) = , \(\mathbb{P}\left\{A = 1 | V = 0\right\}\) = , \(\mathbb{P}\left\{B = 1 | V = 1\right\}\) = , \(\mathbb{P}\left\{B = 1 | V = 0\right\}\) = , \(\mathbb{P}\left\{C = 1 | V = 1\right\}\) = , \(\mathbb{P}\left\{C = 1 | V = 0\right\}\) = . Compute \(\mathbb{P}\){ \(A\) = , \(B\) = , \(C\) = }.
Hint
See Spring 2017 Final Q7. Naive Bayes is a special simple Bayesian Network, so the way to compute the joint probabilities is the same (product of conditional probabilities given the parents): \(\mathbb{P}\left\{A = a, B = b, C = c\right\} = \mathbb{P}\left\{A = a, B = b, C = c, V = 0\right\} + \mathbb{P}\left\{A = a, B = b, C = c, V = 1\right\}\) and \(\mathbb{P}\left\{A = a, B = b, C = c, V = v\right\} = \mathbb{P}\left\{A = a | V = v\right\} \mathbb{P}\left\{B = b | V = v\right\} \mathbb{P}\left\{C = c | V = v\right\} \mathbb{P}\left\{V = v\right\}\).
📗 Answer: .
📗 [1 points] Please enter any comments including possible mistakes and bugs with the questions or your answers. If you have no comments, please enter "None": do not leave it blank.