Young Wu's Homepage

Prev: X1 Next: X3
Back to midtern page: Link, final page: Link

# X2 Past Exam Problems

📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit enter key) 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50 x2

📗 If the questions are not generated correctly, try refresh the page using the button at the top left corner.

📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could print the page: , solve the problems, then enter all your answers at the end.

📗 Please do not refresh the page: your answers will not be saved.

# Warning: please enter your ID before you start!

# Question 1

📗

# Question 2

📗

# Question 3

📗

# Question 4

📗

# Question 5

📗

# Question 6

📗

# Question 7

📗

# Question 8

📗

# Question 9

📗

# Question 10

📗

# Question 11

📗

# Question 12

📗

# Question 13

📗

# Question 14

📗

# Question 15

📗

# Question 16

📗

# Question 17

📗

# Question 18

📗

# Question 19

📗

# Question 20

📗

# Question 21

📗

# Question 22

📗

# Question 23

📗

# Question 24

📗

# Question 25

📗

# Question 26

📗

# Question 27

📗

# Question 28

📗

# Question 29

📗

# Question 30

📗

# Question 31

📗

# Question 32

📗

# Question 33

📗

# Question 34

📗

# Question 35

📗

# Question 36

📗

# Question 37

📗

# Question 38

📗

# Question 39

📗

# Question 40

📗

# Question 41

📗

# Question 42

📗

# Question 43

📗

# Question 44

📗

# Question 45

📗

# Question 46

📗

# Question 47

📗

# Question 48

📗

# Question 49

📗

# Question 50

📗

📗 [3 points] A linear SVM (Support Vector Machine) has \(w\) = and \(b\) = . Which of the following points is predicted positive (label 1)?

📗 Choices:

None of the above

📗 Calculator: .

📗 [2 points] Suppose an SVM (Support Vector Machine) has \(w\) = and \(b\) = . What is the actual distance between the two planes defined by \(w^\top x + b = -1\) and \(w^\top x + b = 1\).

📗 Note: the distance between the two planes is the length of the red line in the diagram, the blue line does not represent the distance between the planes. You may have to rotate the diagram to see.

📗 Answer: .

📗 [4 points] If \(K\left(x, x'\right)\) is a kernel with induced feature representation \(\varphi\left(x_{0}\right)\) = , and \(G\left(x, x'\right)\) is another kernel with induced feature representation \(\theta\left(x_{0}\right)\) = , then it is known that \(H\left(x, x'\right) = a K\left(x, x'\right) + b G\left(x, x'\right)\), \(a\) = , \(b\) = is also a kernel. What is the induced feature representation of \(H\) for this \(x_{0}\)?

📗 Answer (comma separated vector): .

📗 [3 points] Recall a linear SVM (Support Vector Machine) with slack variables has the objective function \(\dfrac{1}{2} w^\top w + C \displaystyle\sum_{i=1}^{n} \varepsilon_{i}\). What is the optimal \(w\) when the trade-off parameter \(C\) is 0? The training data contains only points with label 0 and with label 1. Only enter the weights, no bias.

📗 Answer (comma separated vector): .

📗 [2 points] Consider a small dataset with \(n\) points, where each point is in a dimensional space. For which values of \(n\), there exists a dataset such that, no matter what binary label we give to each point, a linear SVM (Support Vector Machine) can perfectly classify the resulting dataset.

📗 Choices:

None of the above

📗 [2 points] Given a weight vector \(w\) = , consider the line (plane) defined by \(w^\top x = c\) = . Along this line (on the plane), there is a point that is the closest to the origin. How far is that point to the origin in Euclidean distance?

📗 Note: the distance between the point and plane is the length of the red line in the diagram, the length of the blue line is \(\dfrac{c}{w_{z}}\), not the distance between the point and plane.

📗 Answer: .

📗 [2 points] Let \(w\) = and \(b\) = . For the point \(x\) = , \(y\) = , what is the smallest slack value \(\xi\) for it to satisfy the margin constraint?

📗 Answer: .

📗 [6 points] A linear SVM (Support Vector Machine) with with weights \(w_{1}, w_{2}, b\) is trained on the following data set: \(x_{1}\) = , \(y_{1}\) = and \(x_{2}\) = , \(y_{2}\) = . The attributes (i.e. features) are two dimensional \(\left(x_{i1}, x_{i2}\right)\) and the label \(y_{i}\) is binary. The classification rule is \(\hat{y}_{i} = 1_{\left\{w_{1} x_{i1} + w_{2} x_{i2} + b \geq 0\right\}}\). Assuming \(b\) = , what is \(\left(w_{1}, w_{2}\right)\) ?

📗 Answer (comma separated vector): .

📗 [4 points] Given a linear SVM (Support Vector Machine) that perfectly classifies a set of training data containing positive examples and negative examples with 2 support vectors. After adding one more positively labeled training example and retraining the SVM, what is the maximum possible number of support vectors possible in the new SVM.

📗 Answer: .

📗 [4 points] Given a linear SVM (Support Vector Machine) that perfectly classifies a set of training data containing positive examples and negative examples. What is the maximum possible number of training examples that could be removed and still produce the exact same SVM as derived for the original training set?

📗 Answer: .

📗 [4 points] Given a linear SVM (Support Vector Machine) that perfectly classifies a set of training data containing positive examples and negative examples. What is the minimum possible number of training examples that need be removed to cause the margin of a linear SVM to increase? If the answer is impossible, enter "-1".

📗 Answer: .

📗 [4 points] Consider a linear model \(a_{i} = w^\top x_{i} + b\), with the hinge cost function . The initial weight is \(\begin{bmatrix} w \\ b \end{bmatrix}\) = . What is the updated weight and bias after one stochastic (sub)gradient descent step if the chosen training data is \(x\) = , \(y\) = ? The learning rate is .

📗 Answer (comma separated vector): .

📗 [4 points] Given two items \(x_{1}\) = and \(x_{2}\) = , suppose the feature map for a kernel SVM (Support Vector Machine) is \(\varphi\left(x\right)\) = , what is the kernel (Gram) matrix?

📗 Answer (matrix with multiple lines, each line is a comma separated vector): .

📗 [4 points] Given the number of instances in each class summarized in the following table, how many instances are used to train an one-vs-one SVM (Support Vector Machine) for class vs ?

\(y_{i}\)	0	1	2	3	4
Count

📗 Answer: .

📗 [2 points] What are the smallest and largest values of subderivatives of at \(x = 0\).

📗 Answer (comma separated vector): .

📗 [4 points] Given the following training set, add one instance \(\begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix}\) with \(y\) = so that all instances are support vectors for the Hard Margin SVM (Support Vector Machine) trained on the new training set.

\(x_{1}\)	\(x_{2}\)	\(y\)
		0
		0
		0
		1
		1
		1

📗 Note: in the diagram, currently, the two support vectors are connected by the grey line and the black line represents the SVM classification boundary. After adding one point, you should be able to make all seven points support vectors with the classification boundary given by the green line.

📗 Answer (comma separated vector): .

📗 [3 points] A hard margin SVM (Support Vector Machine) is trained on the following dataset. Suppose we restrict \(b\) = , what is the value of \(w\)? Enter a single number, i.e. do not include \(b\). Assume the SVM classifier is \(1_{\left\{w x + b \geq 0\right\}}\) (this means it predict 1 if \(w x + b \geq 0\) and 0 otherwise.

\(x_{i}\)
\(y_{i}\)

📗 Answer: .

📗 [3 points] Given there are data points, each data point has features, the feature map creates new features (to replace the original features). What is the size of the kernel matrix when training a kernel SVM (Support Vector Machine)? For example, if the matrix is \(2 \times 2\), enter the number \(4\).

📗 Answer: .

📗 [3 points] What is the city-block distance (also known as L1 distance or Manhattan distance) between two points and ?

📗 Answer: .

📗 [3 points] Consider binary classification in 2D where the intended label of a point \(x = \begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix}\) is positive (1) if \(x_{1} > x_{2}\) and negative (0) otherwise. Let the training set be all points of the form \(x\) = where \(a, b\) are integers. Each training item has the correct label that follows the rule above. With a 1NN (Nearest Neighbor) classifier (Euclidean distance), which ones of the following points are labeled positive? The drawing is not graded.

📗 Choices:

None of the above

📗 Calculator: .

📗 [3 points] Consider points in 2D and binary labels. Given the training data in the table, and use Manhattan distance with 1NN (Nearest Neighbor), which of the following points in 2D are classified as 1? Answer the question by first drawing the decision boundaries. The drawing is not graded.

index	\(x_{1}\)	\(x_{2}\)	label
1	-1	-1
2	-1	1
3	1	-1
4	1	1

📗 Choices:

None of the above

📗 [4 points] You have a data set with positive items and negative items. You perform a "leave-one-out" procedure: for each item i, learn a separate kNN (k Nearest Neighbor) classifier on all items except item i, and compute that kNN's accuracy in predicting item i. The leave-one-out accuracy is defined to be the average of the accuracy for each item. What is the leave-one-out accuracy when k = ?

📗 Answer: .

📗 [4 points] You are given a training set of five points and their 2-class classifications (+ or -): (, +), (, +), (, -), (, -), (, -). What is the decision boundary associated with this training set using 3NN (3 Nearest Neighbor)?

📗 Answer: .

📗 [4 points] Say we have a training set consisting of positive examples and negative examples where each example is a point in a two-dimensional, real-valued feature space. What will the classification accuracy be on the training set with NN (Nearest Neighbor).

📗 Answer: .

📗 [0 points] To be added.

📗 [3 points] Let a dataset consist of \(n\) = points in \(\mathbb{R}\), specifically, the first \(n - 1\) points are and the last point \(x_{n}\) is unknown. What is the smallest value of \(x_{n}\) above which \(x_{n-1}\) is among \(x_{n}\)'s 3-nearest neighbors, but \(x_{n}\) is NOT among \(x_{n-1}\)'s 3-nearest neighbor? Note that the 3-nearest neighbors of a point in the training set include the point itself.

📗 Answer: .

📗 [4 points] List English letters from A to Z: ABCDEFGHIJKLMNOPQRSTUVWXYZ. Define the distance between two letters in the natural way, that is \(d\left(A, A\right) = 0\), \(d\left(A, B\right) = 1\), \(d\left(A, C\right) = 2\) and so on. Each letter has a label, are labeled 0, and the others are labeled 1. This is your training data. Now classify each letter using kNN (k Nearest Neighbor) for odd \(k = 1, 3, 5, 7, ...\). What is the smallest \(k\) where all letters are classified the same (same label, i.e. either all labels are 0s or all labels are 1s). Break ties by preferring the earlier letters in the alphabet. Hint: the nearest neighbor of a letter is the letter itself.

📗 Answer: .

📗 [3 points] Consider a -dimensional feature space where each feature takes integer value from 0 to (including 0 and ). What is the smallest and largest distance between the two distinct (non-overlapping) points in the feature space?

📗 Answer (comma separated vector): .

📗 [2 points] You have a dataset with unique data points (half of which are labeled 0 and the other half labeled 1) which you want to use to train a kNN (k Nearest Neighbor) classifier. You setup the experiment as follows: you train kNN classifiers: \(k\) = using all the data points. Then you randomly select data points from the training set, and classify them using each of the classifiers. Which classifier (enter the \(k\) value) will have the highest accuracy? Your answer should not depend on which random subset is selected.

📗 Answer: .

📗 [3 points] Consider a training set with 8 items. The first dimension of their feature vectors are: . However, this dimension is continuous (i.e. it is a real number). To build a decision tree, one may ask questions in the form "Is \(x_{1} \geq \theta\)"? where \(\theta\) is a threshold value. Ideally, what is the maximum number of different \(\theta\) values we should consider for the first dimension \(x_{1}\)? Count the values of \(\theta\) such that all instances belong to one class.

📗 Answer: .

📗 [3 points] A decision tree has depth \(d\) = (a decision tree where the root is a leaf node has \(d\) = 0). All its internal node have \(b\) = children. The tree is also complete, meaning all leaf nodes are at depth \(d\). If we require each leaf node to contain at least training examples, what is the minimum size of the training set?

📗 Answer: .

📗 [3 points] A bag contains \(n\) = different colored balls. Randomly draw a ball from the bag with equal probability. What is the entropy of the outcome? Reminder that log based 2 of x can be found by log(x) / log(2) or log2(x).

📗 Answer: .

📗 [3 points] Statistically, December 18 is the cloudiest day of the year in Madison, Wisconsin. Your professor (not me, this is Professor Jerry Zhu's question) is not making this up. On that day, the sky is overcast, mostly cloudy, or partly cloudy of the time (C = 0), and clear or mostly clear of the time (C = 1). What is the entropy of the binary random variable C? Reminder that log based 2 of x can be found by log(x) / log(2).

📗 Answer: .

📗 [3 points] The RDA Corporation has a prison with many cells. Without justification, you're about to be randomly thrown into a cell with equal probability. Cells to have Toruks that eat prisoners. Cells to are safe. With sufficient bribe, the warden will answer your question "Will I be in cell 1?" What's the mutual information (we call it information gain) between the warden's answer and your encounter with the Toruks? (I didn't write the stories in these questions, so I don't know the reference too.)

📗 Answer: .

📗 [4 points] What is the conditional entropy \(H\left(B|A\right)\) for the following set of training examples.

item	A	B
1
2
3
4
5
6
7
8

📗 Answer: .

📗 [4 points] In a problem where each example has real-valued attributes (i.e. features), where each attribute can be split at possible thresholds (i.e. binary splits), to select the best attribute for a decision tree node at depth , where the root is at depth 0, how many conditional entropies must be calculated (at most)?

📗 Answer: .

📗 [4 points] There are parrots. They have either a red beak or a black beak. They can either talk or not. Complete the two cells in the following table so that the mutual information (i.e. information gain) between "Beak" and "Talk" is :

Number of parrots	Beak	Talk
	Red	Yes
?	Red	No
??	Black	Yes
	Black	No

📗 Answer (comma separated vector): .

📗 [3 points] A hospital trains a decision tree to predict if any given patient has technophobia or not. The training set consists of patients. There are features. The labels are binary. The decision tree is not pruned. What are the smallest and largest possible training set accuracy of the decision tree? Enter two numbers between 0 and 1. Hint: patients with the same features may have different labels.

📗 Answer (comma separated vector): .

📗 [2 points] There is a total of red or green balls in a bag. How many red balls and how many green balls are there so that the entropy of the color of a randomly selected ball is imized?

📗 Answer (comma separated vector): .

📗 [3 points] Suppose there are \(2\) discrete features \(x_{1}, x_{2}\) that can take on values and , and a binary decision tree is trained based on these features. What is the maximum number of leafs the decision tree can have?

📗 Answer: .

📗 [3 points] Given three decision stumps in a random forest in the following table, what is the predicted label for a new data point \(x\) = \(\begin{bmatrix} x_{1} \\ x_{2} \\ ... \end{bmatrix}\) = ? Enter a single number (-1 or 1; and 0 in case of a tie).

Index	Decision stump	-
1	Label 1 if	Label -1 otherwise
2	Label 1 if	Label -1 otherwise
3	Label 1 if	Label -1 otherwise

📗 Answer: .

📗 [4 points] Consider a kernel \(K\left(x_{i_{1}}, x_{i_{2}}\right)\) = + + , where both \(x_{i_{1}}\) and \(x_{i_{2}}\) are 1D positive real numbers. What is the feature vector \(\varphi\left(x_{i}\right)\) induced by this kernel evaluated at \(x_{i}\) = ?

📗 Answer (comma separated vector): .

📗 [4 points] A convolutional neural network has input image of size x that is connected to a convolutional layer that uses a x filter, zero padding of the image, and a stride of 1. There are activation maps. (Here, zero-padding implies that these activation maps have the same size as the input images.) The convolutional layer is then connected to a pooling layer that uses x max pooling, a stride of (non-overlapping, no padding) of the convolutional layer. The pooling layer is then fully connected to an output layer that contains output units. There are no hidden layers between the pooling layer and the output layer. How many different weights must be learned in this whole network, not including any bias.

📗 Answer: .

📗 [4 points] What is the convolution between the image and the filter using zero padding? Remember to flip the filter first.

📗 Answer (matrix with multiple lines, each line is a comma separated vector): .

📗 [4 points] In a convolutional neural network, suppose the activation map of a convolution layer is . What is the activation map after a non-overlapping (stride 2) 2 by 2 max-pooling layer?

📗 Answer (matrix with multiple lines, each line is a comma separated vector): .

📗 [4 points] What is the gradient magnitude of the center element (pixel) of the image . Use the x gradient filter: \(\begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}\), and the y gradient filter: \(\begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix}\). Remember to flip the filters.

📗 Answer: .

📗 [2 points] Given the following image gradient, suppose gradient vectors are put into one of the four bins according to the gradient direction: bin 1: \(\left(0, \dfrac{\pi}{2}\right]\), bin 2: \(\left(\dfrac{\pi}{2}, \pi\right]\), bin 3: \([- \dfrac{\pi}{2}, -\pi)\), bin 4: \(\left[0, - \dfrac{\pi}{2}\right)\), which bin does the gradient of the center element (pixel) fall into?

\(\nabla_{x}\) = , \(\nabla_{y}\) = .
Enter the bin number (1, 2, 3, or 4), not the direction.

📗 Calculator (you can use the function atan2(y, x)): .

📗 Answer: .

📗 [1 points] Blank.

📗 Answer: .

📗 [1 points] Blank.

📗 Answer: .

📗 [1 points] Blank.

📗 Answer: .

# Grade

* * * * *

* * * * *

📗 You could save the text in the above text box to a file using the button or copy and paste it into a file yourself .

📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##x: 2" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.

Last Updated: July 01, 2025 at 1:48 AM