Young Wu's Homepage

# XM1 Exam Part 1 Version B

📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit the "Enter" key)

📗 You can also load from your saved file
and click .

📗 If the questions are not generated correctly, try refresh the page using the button at the top left corner.

📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could print the page: , solve the problems, then enter all your answers at the end.

📗 Please do not refresh the page: your answers will not be saved.

📗 Please join Zoom for announcements: Link.

# Warning: please enter your ID before you start!

# Question 1

# Question 2

# Question 3

# Question 4

# Question 5

# Question 6

# Question 7

# Question 8

# Question 9

# Question 10

# Question 11

# Question 12

# Question 13

# Question 14

# Question 15

📗 [2 points] The Perceptron algorithm does not terminate (cannot converge) for any learning rate on the following training set. Give an example of \(y_{1}\) (either 0 or 1). If there are multiple possible answers, enter only one, and if no such \(y_{1}\) exists, enter \(-1\).

\(i\)	\(x_{i}\)	\(y_{i}\)
1		\(y_{1}\)
2
3
4

📗 Answer: .

📗 [4 points] Suppose the partial derivative of a cost function \(C\) with respect to some weight \(w\) is given by \(\dfrac{\partial C}{\partial w} = \displaystyle\sum_{i=1}^{n} \dfrac{\partial C_{i}}{\partial w} = \displaystyle\sum_{i=1}^{n} w x_{i}\). Given a data set \(x\) = {} and initial weight \(w\) = , compute and compare the updated weight after 1 step of batch gradient descent and steps of stochastic gradient descent (start with the same initial weight, then use data point 1 in step 1, data point 2 in step 2, ...). Enter two numbers, comma separated, batch gradient descent first. Use the learning rate \(\alpha\) = .

📗 Answer (comma separated vector): .

📗 [3 points] Suppose there are three classifiers \(f_{1}, f_{2}, f_{3}\) to choose from (i.e. the hypothesis space has three elements), and the activation values from these classifiers based on a training set of three items are listed below. Which classifier is the best one if loss is used for comparison? (Enter a number 1 or 2 or 3).

📗 Reminder: zero-one loss means \(\displaystyle\sum_{i=1}^{n} 1_{\left\{a_{i} \neq y_{i}\right\}}\), square loss means \(\displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)^{2}\), cross entropy loss means \(\displaystyle\sum_{i=1}^{n} -y_{i} \log\left(a_{i}\right) + \left(1 - y_{i}\right) \log\left(1 - a_{i}\right)\).

Items	1	2	3
\(y\)
\(f_{1}\)
\(f_{2}\)
\(f_{3}\)

📗 Answer: .

📗 [3 points] Given the following bigram (Markov) transition matrix , the rows (columns) representing the word tokens . What is the probability, given we start with , we get , where the sequence is repeated times, and there are a total of words including the initial word.

📗 Answer: .

[3 points] Suppose the only two support vectors in a data set is with label and with label , what is the margin of a hard-margin SVM (support vector machine) trained on this data set.

📗 Answer: .

📗 [3 points] Suppose there are \(2\) discrete features \(x_{1}, x_{2}\) that can take on values and , and a binary decision tree is trained based on these features. What is the maximum number of leafs the decision tree can have?

📗 Answer: .

📗 [3 points] Suppose you are given a neural network with hidden layers, input units, output units, and hidden units. In one backpropogation step when computing the gradient of the cost (for example, squared loss) with respect to \(w^{\left(1\right)}_{11}\), the weight in layer \(1\) connecting input \(1\) and hidden unit \(1\), how many weights (including \(w^{\left(1\right)}_{11}\) itself, and including biases) are used in the backpropogation step of \(\dfrac{\partial C}{\partial w^{\left(1\right)}_{11}}\)?

📗 Note: the backpropogation step assumes the activations in all layers are already known so do not count the weights and biases in the forward step computing the activations.

📗 Answer: .

📗 [3 points] Recall a SVM (Support Vector Machine) with slack variables has the objective function \(\dfrac{\lambda}{2} w^\top w + \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} \xi_{i}\), which is equivalent to \(\dfrac{1}{2} w^\top w + C \displaystyle\sum_{i=1}^{n} \xi_{i}\). What is the optimal \(w\) when the trade-off parameter \(C\) is 0? The training data contains only points with label 0 and with label 1. Only enter the weights, no bias.

📗 Answer (comma separated vector): .

📗 [4 points] Consider a linear model \(a_{i} = w^\top x_{i} + b\), with the hinge cost function . The initial weight is \(\begin{bmatrix} w \\ b \end{bmatrix}\) = . What is the updated weight and bias after one stochastic (sub)gradient descent step if the chosen training data is \(x\) = , \(y\) = ? The learning rate is .

📗 Answer (comma separated vector): .

📗 [4 points] Consider a classification problem with \(n\) = classes \(y \in \left\{1, 2, ..., n\right\}\), and two binary features \(x_{1}, x_{2} \in \left\{0, 1\right\}\). Suppose \(\mathbb{P}\left\{Y = y\right\}\) = , \(\mathbb{P}\left\{X_{1} = 1 | Y = y\right\}\) = , \(\mathbb{P}\left\{X_{2} = 1 | Y = y\right\}\) = . Which class will naive Bayes classifier produce on a test item with \(X_{1}\) = and \(X_{2}\) = .

📗 Answer: .

📗 [4 points] You are given a training set of six points and their 2-class classifications (+ or -): (, +), (, +), (, +), (, -), (, -), (, -). What is the decision boundary associated with this training set using 3NN (3 Nearest Neighbor)? Note: there is one more point compared to the question from the homework.

📗 Answer: .

📗 [3 points] Given there are data points, each data point has features, the feature map creates new features (to replace the original features). What is the size of the kernel matrix when training a kernel SVM (Support Vector Machine)? For example, if the matrix is \(2 \times 2\), enter the number \(4\).

📗 Answer: .

📗 [4 points] Given the following transition matrix for a bigram model with words "" and "": . Row \(i\) column \(j\) is \(\mathbb{P}\left\{w_{t} = j | w_{t-1} = i\right\}\). What is the probability that the third word is "" given the first word is ""?

📗 Answer: .

📗 [3 points] A tweet is ratioed if at least one reply gets more likes than the tweet. Suppose a tweet has replies, and each one of these replies gets more likes than the tweet with probability if the tweet is bad, and probability if the tweet is good. Given a tweet is ratioed, what is the probability that it is a bad tweet? The prior probability of a bad tweet is .

📗 Answer: .

📗 [1 points] Please enter any comments including possible mistakes and bugs with the questions or your answers. If you have no comments, please enter "None": do not leave it blank.

📗 Answer: .

# Grade

* * * * *

* * * * *

# Submission

📗 Please do not modify the content in the above text field: use the "Grade" button to update.

📗 Please wait for the message "Successful submission." to appear after the "Submit" button. If there is an error message or no message appears after 10 seconds, please save the text in the above text box to a file using the button or copy and paste it into a file yourself and submit it to Canvas Assignment MX1. You could submit multiple times (but please do not submit too often): only the latest submission will be counted.

📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##x: 3" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.

Last Updated: November 21, 2025 at 11:40 PM