📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click 1,2,3,4,5,6,7,8,9,101
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could print the page: , solve the problems, then enter all your answers at the end.
📗 Please do not refresh the page: your answers will not be saved. You can save and load your answers (only fill-in-the-blank questions) using the buttons at the bottom of the page.
📗 [4 points] Consider a linear model \(a_{i} = w^\top x_{i} + b\), with the hinge cost function . The initial weight is \(\begin{bmatrix} w \\ b \end{bmatrix}\) = . What is the updated weight and bias after one stochastic (sub)gradient descent step if the chosen training data is \(x\) = , \(y\) = ? The learning rate is .
Hint
The derivative of the cost function with respect to the weights given one training data point \(i\) can be computed as \(\dfrac{\partial C}{\partial w_{j}} = \dfrac{\partial C}{\partial a_{i}} \dfrac{\partial a_{i}}{\partial w_{j}}\), where \(\dfrac{\partial C}{\partial a_{i}}\) depends on the function given in the question and \(\dfrac{\partial a_{i}}{\partial w_{j}}\) is \(x_{i j}\) since the activation function is linear. The updated weight \(j\) can be found using the gradient descent formula \(w_{j} - \alpha \dfrac{\partial C}{\partial w_{j}}\). The derivative and update for \(b\) can be computed similarly.
📗 Answer (comma separated vector): .
📗 [2 points] In a three-layer (fully connected) neural network, the first hidden layer contains sigmoid units, the second hidden layer contains units, and the output layer contains units. The input is dimensional. How many weights plus biases does this neural network have? Enter one number.
📗 The above is a diagram of the network, the nodes labelled "1" are the bias units.
Hint
See Fall 2019 Final Q14, Fall 2013 Final Q8, Fall 2006 Final Q17, Fall 2005 Final Q17. Three-layer neural networks have one input layer (same number of units as the input dimension), two hidden layers, and one output layer (usually the same number of units as the number of classes (labels), but in case there are only two classes, the number of units can be 1). We are using the convention of calling neural networks with four layers "three-layer neural networks" because there are only three layers with weights and biases (so we don't count the input layer). The number of weights between two consecutive layers (\(m_{1}\) units in the previous layer, \(m_{2}\) units in the next layer) is \(m_{1} \cdot m_{2}\), and the number of biases is \(m_{2}\).
📗 Answer: .
📗 [2 points] Suppose an SVM (Support Vector Machine) has \(w\) = and \(b\) = . What is the actual distance between the two planes defined by \(w^\top x + b = -1\) and \(w^\top x + b = 1\).
📗 Note: the distance between the two planes is the length of the red line in the diagram, the blue line does not represent the distance between the planes. You may have to rotate the diagram to see.
Hint
See Fall 2014 Midterm Q14. The distance between the two planes is \(\dfrac{2}{\sqrt{w^\top w}}\).
📗 Answer: .
📗 [3 points] Statistically, cats are often hungry around 6:00 am (I am making this up). At that time, a cat is hungry of the time (C = 1), and not hungry of the time (C = 0). What is the entropy of the binary random variable C? Reminder that log based 2 of x can be found by log(x) / log(2) or log2(x).
Hint
See Fall 2014 Midterm Q10, Fall 2006 Final Q11, Fall 2005 Final Q11. The entropy formula is \(H = -p_{1} \log_{2}\left(p_{1}\right) - p_{2} \log_{2}\left(p_{2}\right)\).
📗 Answer: .
📗 [4 points] Say we have a training set consisting of positive examples and negative examples where each example is a point in a two-dimensional, real-valued feature space. What will the classification accuracy be on the training set with NN (Nearest Neighbor).
Hint
See Spring 2017 Midterm Q6, Fall 2014 Final Q19.
📗 Answer: .
📗 [4 points] Given the following neural network that classifies all the training instances correctly. What are the labels (0 or 1) of the training data? The activation functions are LTU for all units: \(1_{\left\{z \geq 0\right\}}\). The first layer weight matrix is , with bias vector , and the second layer weight vector is , with bias
\(x_{i1}\)
\(x_{i2}\)
\(y_{i}\) or \(a^{\left(2\right)}_{1}\)
0
0
?
0
1
?
1
0
?
1
1
?
Note: if the weights are not shown clearly, you could move the nodes around with mouse or touch. Hint
See Fall 2010 Final Q17. First compute the hidden layer units: \(h_{j} = 1_{\left\{w^{\left(1\right)}_{1j} x_{1} + w^{\left(1\right)}_{2j} x_{2} + b_{j} \geq 0\right\}}\), then compute the outputs (which are equal to the training data labels): \(y = o_{1} = 1_{\left\{w^{\left(2\right)}_{1} h_{1} + w^{\left(2\right)}_{2} h_{2} + b \geq 0\right\}}\). Repeat the computations for \(\left(x_{1}, x_{2}\right) = \left(0, 0\right), \left(0, 1\right), \left(1, 0\right), \left(1, 1\right)\).
📗 Answer (comma separated vector): .
📗 [4 points] Some Na'vi's don't wear underwear, but they are too embarrassed to admit that. A surveyor wants to estimate that fraction and comes up with the following less-embarrassing scheme: Upon being asked "do you wear your underwear", a Na'vi would flip a fair coin outside the sight of the surveyor. If the coin ends up head, the Na'vi agrees to say "Yes"; otherwise the Na'vi agrees to answer the question truthfully. On a very large population, the surveyor hears the answer "Yes" for fraction of the population. What is the estimated fraction of Na'vi's that don't wear underwear? Enter a fraction like 0.01 instead of a percentage 1%.
Hint
See Fall 2011 Midterm Q13. Let the fraction be \(f\), then the fraction of Yes would be \(0.5 + 0.5 \cdot \left(1 - f\right)\).
📗 Answer: .
📗 [2 points] You have a vocabulary with word types. You want to estimate the unigram probability \(p_{w}\) for each word type \(w\) in the vocabulary. In your corpus the total word token count \(\displaystyle\sum_{w} c_{w}\) is , and \(c_{\text{tenet}}\) = . Using add-one smoothing \(\delta\) = (Laplace smoothing), compute \(p_{\text{tenet}}\).
Hint
See Fall 2018 Midterm Q12. The maximum likelihood estimate of \(p_{w}\) is \(\dfrac{c_{w} + \delta}{\displaystyle\sum_{w'} c_{w'} + n \delta}\).
📗 Answer: .
📗 [3 points] Consider the following directed graphical model over binary variables: \(A \to B \to C\). Given the CPTs (Conditional Probability Table):
Variable
Probability
Variable
Probability
\(\mathbb{P}\left\{A = 1\right\}\)
\(\mathbb{P}\left\{B = 1 | A = 1\right\}\)
\(\mathbb{P}\left\{B = 1 | A = 0\right\}\)
\(\mathbb{P}\left\{C = 1 | B = 1\right\}\)
\(\mathbb{P}\left\{C = 1 | B = 0\right\}\)
What is the probability that \(\mathbb{P}\){ \(A\) = , \(B\) = , \(C\) = }? Hint
See Fall 2019 Final Q22 Q23 Q24 Q25, Spring 2018 Final Q24 Q25, Fall 2014 Final Q9, Fall 2006 Final Q20, Fall 2005 Final Q20. For any Bayes net, the joint probability can always be computed as the product of the conditional probabilities (conditioned on the parent node variable). For a causal chain \(A \to B \to C\), \(\mathbb{P}\left\{A = a, B = b, C = c\right\} = \mathbb{P}\left\{A = a\right\} \mathbb{P}\left\{B = b | A = a\right\} \mathbb{P}\left\{C = c | B = b\right\}\). For a common cause \(A \leftarrow B \to C\), \(\mathbb{P}\left\{A = a, B = b, C = c\right\} = \mathbb{P}\left\{A = a | B = b\right\} \mathbb{P}\left\{B = b\right\} \mathbb{P}\left\{C = c | B = b\right\}\). For a common effect \(A \to B \leftarrow C\), \(\mathbb{P}\left\{A = a, B = b, C = c\right\} = \mathbb{P}\left\{A = a\right\} \mathbb{P}\left\{B = b | A = a, C = c\right\} \mathbb{P}\left\{C = c\right\}\).