Young Wu's Homepage

# Midterm M2A

📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click

📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could print the page: , solve the problems, then enter all your answers at the end.

📗 Please do not refresh the page: your answers will not be saved. You can save and load your answers (only fill-in-the-blank questions) using the buttons at the bottom of the page.

# Warning: please enter your ID before you start!

# Question 1

# Question 2

# Question 3

# Question 4

# Question 5

# Question 6

# Question 7

# Question 8

# Question 9

# Question 10

📗 [4 points] Given a linear SVM (Support Vector Machine) that perfectly classifies a set of training data containing positive examples and negative examples. What is the maximum possible number of training examples that could be removed and still produce the exact same SVM as derived for the original training set?

Hint

See Fall 2019 Final Q7 Q8.

📗 Answer: .

📗 [4 points] You are given a training set of five points and their 2-class classifications (+ or -): (, +), (, +), (, -), (, -), (, -). What is the decision boundary associated with this training set using 3NN (3 Nearest Neighbor)?

Hint

See Spring 2017 Midterm Q6. The decision boundary is the threshold such that all points on its left is classified as positive, and all points on its right is classified as negative. The threshold should be equidistant from the first and fourth points (i.e. the midpoint between the first and fourth points).

📗 Answer: .

📗 [2 points] In a corpus with word tokens, the phrase "Fort Night" appeared times (not Fortnite). In particular, "Fort" appeared times and "Night" appeared . If we estimate probability by frequency (the maximum likelihood estimate) without smoothing, what is the estimated probability of P(Night | Fort)?

Hint

See Fall 2017 Midterm Q7, Fall 2016 Final Q4. The maximum likelihood estimate of \(\mathbb{P}\left\{B | A\right\} = \dfrac{\mathbb{P}\left\{A B\right\}}{\mathbb{P}\left\{A\right\}}\) is \(\dfrac{n_{A B}}{n_{A}}\).

📗 Answer: .

📗 [4 points] Consider a classification problem with \(n\) = classes \(y \in \left\{1, 2, ..., n\right\}\), and two binary features \(x_{1}, x_{2} \in \left\{0, 1\right\}\). Suppose \(\mathbb{P}\left\{Y = y\right\}\) = , \(\mathbb{P}\left\{X_{1} = 1 | Y = y\right\}\) = , \(\mathbb{P}\left\{X_{2} = 1 | Y = y\right\}\) = . Which class will naive Bayes classifier produce on a test item with \(X_{1}\) = and \(X_{2}\) = .

Hint

See Fall 2016 Final Q18, Fall 2011 Midterm Q20. Use the Bayes rule: \(\mathbb{P}\left\{Y = y | X_{1} = x_{1}, X_{2} = x_{2}\right\} = \dfrac{\mathbb{P}\left\{X_{1} = x_{1}, X_{2} = x_{2} | Y = y\right\} \mathbb{P}\left\{Y = y\right\}}{\displaystyle\sum_{y'=1}^{n} \mathbb{P}\left\{X_{1} = x_{1}, X_{2} = x_{2} | Y = y'\right\} \mathbb{P}\left\{Y = y'\right\}}\), which is equal to \(\dfrac{\mathbb{P}\left\{X_{1} = x_{1} | Y = y\right\} \mathbb{P}\left\{X_{2} = x_{2} | Y = y\right\} \mathbb{P}\left\{Y = y\right\}}{\displaystyle\sum_{y'=1}^{n} \mathbb{P}\left\{X_{1} = x_{1} | Y = y'\right\} \mathbb{P}\left\{X_{2} = x_{2} | Y = y'\right\} \mathbb{P}\left\{Y = y'\right\}}\), due to the independence assumption of Naive Bayes. For Bayesian network that are not Naive, the second equality is not true. Naive Bayes classifier selects the \(y\) that maximizes \(\mathbb{P}\left\{Y = y | X_{1} = x_{1}, X_{2} = x_{2}\right\}\): since the denominators for these probabilities are the same, and the prior probability is constant, the classifier is effectively selecting the \(y\) that maximizes \(\mathbb{P}\left\{X_{1} = x_{1} | Y = y\right\} \mathbb{P}\left\{X_{2} = x_{2} | Y = y\right\}\) which is a function in \(y\). You can try different values of \(y\) to find the maximizer or use the first derivative condition if the number of classes is large (i.e. compare the integers near the places where the first derivative is zero and the end points).

📗 Answer: .

📗 [4 points] What is the gradient magnitude of the center element (pixel) of the image . Use the x gradient filter: \(\begin{bmatrix} -1 & 0 & 1 \end{bmatrix}\), and the y gradient filter: \(\begin{bmatrix} -1 \\ 0 \\ 1 \end{bmatrix}\). Remember to flip the filters.

Hint

📗 Answer: .

📗 [4 points] If \(\mathbb{P}\left\{A | B\right\}\) is times the value of \(\mathbb{P}\left\{B | A\right\}\), and \(\mathbb{P}\left\{A\right\}\) = . What is \(\mathbb{P}\left\{B\right\}\)?

Hint

See Fall 2013 Final Q11, Fall 2012 Midterm Q9, Fall 2011 Midterm Q14.

📗 Answer: .

📗 [4 points] Given two items \(x_{1}\) = and \(x_{2}\) = , suppose the feature map for a kernel SVM (Support Vector Machine) is \(\varphi\left(x\right)\) = , what is the kernel (Gram) matrix?

Hint

📗 Answer (matrix with multiple lines, each line is a comma separated vector): .

📗 [4 points] Say we use Naive Bayes in an application where there are features represented by variables, each having possible values, and there are classes. How many probabilities must be stored in the CPTs (Conditional Probability Table) in the Bayesian network for this problem? Do not include probabilities that can be computed from other probabilities.

Hint

See Fall 2019 Final Q27.

📗 Answer: .

📗 [4 points] A convolutional neural network has input image of size x that is connected to a convolutional layer that uses a x filter, zero padding of the image, and a stride of 1. There are activation maps. (Here, zero-padding implies that these activation maps have the same size as the input images.) The convolutional layer is then connected to a pooling layer that uses x max pooling, a stride of (non-overlapping, no padding) of the convolutional layer. The pooling layer is then fully connected to an output layer that contains output units. There are no hidden layers between the pooling layer and the output layer. How many different weights must be learned in this whole network, not including any bias.

Hint

See Fall 2019 Final Q15, Spring 2018 Midterm Q8 Q9 Q10 Q11, Fall 2017 Final Q5, Spring 2017 Final Q5, Fall 2017 Midterm Q9, Fall 2017 Midterm Q11. Each k by k filter in the first layer has \(k \times k\) weights, the number of such filters depend on the number of activation maps in the next layer. The pooling layers do not have weights, but the number of units in the next layer depends on the pooling filter size (reduces the units by a factor of the filter size). The last layer is fully connected, so the number of weights is the product between the number of units in the previous layer and the number of output units.

📗 Answer: .

📗 [1 points] Please enter any comments including possible mistakes and bugs with the questions or your answers. If you have no comments, please enter "None": do not leave it blank.

📗 Answer: .

# Grade

* * * * *

* * * * *

# Warning: remember to submit this on Canvas!

📗 Please copy and paste the text between the *s (not including the *s) and submit it on Canvas, M2A.

📗 Please save a copy as text file using the button or just copy and paste it into a text file.

📗 You could load your answers using the button from the text field:

Last Updated: July 14, 2024 at 9:38 PM