Young Wu's Homepage

# Epic Section Final - Online

📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit enter key)

📗 You could print the page: and solve the problems on paper or annotate on the PDF file. You can also write your answers on blank papers or files. To get full points, you have to state your final answers clearly and provide explanations how you obtained the answers.

📗 Please submit the file (scanned or annotated) on Canvas to Assignment X1 before the end of the exam.

# Warning: please enter your ID before you start!

# Epic Section Final - In Person

📗 Name: ____________________

📗 Wisc ID: ____________________

📗 Please state your final answers clearly. You do not have to evaluate mathematical expressions. You do not have to fit your answers into the answer text boxes.

# Question 1

# Question 2

# Question 3

# Question 4

# Question 5

# Question 6

# Question 7

# Question 8

# Question 9

# Question 10

# Question 11

# Question 12

# Question 13

# Question 14

# Question 15

# Question 16

# Question 17

# Question 18

# Question 19

# Question 20

# Question 21

# Question 22

# Question 23

# Question 24

# Question 25

# Question 26

# Question 27

# Question 28

# Question 29

# Question 30

# Blank Page

📗 [4 points] Given the following transition matrix for a bigram model with words "" and "": . Row \(i\) column \(j\) is \(\mathbb{P}\left\{w_{t} = j | w_{t-1} = i\right\}\). What is the probability that the third word is "" given the first word is ""?

📗 Answer: .

📗 [3 points] Given an infinite state sequence where the pattern "" is repeated infinite number of times. What is the (maximum likelihood) estimated transition probability from state to (without smoothing)?

📗 Answer: .

📗 [2 points] In a corpus (set of documents) with word types (unique word tokens), the phrase "" appeared times. In particular, "" appeared times and "" appeared . If we estimate probability by frequency (the maximum likelihood estimate) with Laplace smoothing (add-1 smoothing), what is the estimated probability of \(\mathbb{P}\){ | }?

📗 Answer: .

📗 [3 points] There are biased coins in my pocket: coin A has \(\mathbb{P}\left\{H | A\right\}\) = , coin B has \(\mathbb{P}\left\{H | B\right\}\) = , coin C has \(\mathbb{P}\left\{H | C\right\}\) = (H for Heads and T for Tails). I took out a coin from the pocket at random with probability of A is and B is . I flipped it times the outcome is . What is the probability that the coin was ?

📗 Answer: .

📗 [4 points] Given the counts, find the maximum likelihood estimate of \(\mathbb{P}\left\{A = 1|B + C = s\right\}\), for \(s\) = .

A	B	C	counts
0	0	0
0	0	1
0	1	0
0	1	1
1	0	0
1	0	1
1	1	0
1	1	1

📗 Answer: .

📗 [3 points] Given two binary features, \(X_{1}\) and \(X_{2}\), where \(\mathbb{P}\left\{X_{1} = 1\right\}\) = , \(\mathbb{P}\left\{X_{2} = 1\right\}\) = , and \(\mathbb{P}\left\{X_{1} = 1 | X_{2} = 0\right\}\) = , what is \(\mathbb{P}\left\{X_{1} = 1 | X_{2} = 1\right\}\)?

📗 Answer: .

📗 [4 points] Consider the problem of detecting if an email message is a spam. Say we use three variables to model this problem: a binary label \(S\) indicates if the message is a spam, and two binary features: \(C, F\) indicating whether the message contains "Cash" and "Free". We use a Naive Bayes classifier with the following estimated probabilities from the training set:

Prior	\(\mathbb{P}\left\{S = 1\right\}\) =	-
Hams	\(\mathbb{P}\left\{C = 1 \| S = 0\right\}\) =	\(\mathbb{P}\left\{F = 1 \| S = 0\right\}\) =
Spams	\(\mathbb{P}\left\{C = 1 \| S = 1\right\}\) =	\(\mathbb{P}\left\{F = 1 \| S = 1\right\}\) =

Compute the posterior probability that the email is a spam given the following features: \(\mathbb{P}\){\(S = 1\) | \(C\) = , \(F\) = }.

📗 Answer: .

📗 [3 points] Consider a vector \(x\) = , if the principal component is = , what is the reconstruction of \(x\) using only the first principal components? If more information is needed, please enter a vector of all 0's.

📗 Answer (comma separated vector): .

📗 [3 points] What is the distance between clusters \(C_{1}\) = {} and \(C_{2}\) = {} using linkage?

📗 Answer: .

📗 [3 points] Given the pairwise distance matrix \(d\), what is the linkage distance between the cluster {} and {}? The columns and rows are indexed \(1, 2, 3, ...\), i.e. row \(i\) column \(j\) is the distance between point \(i\) and point \(j\).

d =

📗 Answer: .

📗 [4 points] Given the dataset , the cluster centers are computed by k-means clustering algorithm with \(k = 2\). The first cluster center is \(x\) and the second cluster center is . What is the imum value of \(x\) such that the second cluster is empty (contains 0 instances). In case of a tie in distance, the point belongs to cluster 1.

📗 Answer: .

📗 [4 points] Given the following training data, what is the fold cross validation accuracy (i.e. LOOCV, Leave One Out Cross Validation) if NN (Nearest Neighbor) classifier with Manhattan distance is used. Break the tie (in distance) by using the instance with the smaller index. Enter a number between 0 and 1.

Index	1	2	3	4	5
\(x_{i}\)
\(y_{i}\)

📗 Answer: .

📗 [4 points] Say we have a training set consisting of items with label \(0\), and items with label \(1\) where each item has two features and all items have distinct features. What is the classification accuracy of NN (Nearest Neighbor) on the training set (note: this is not k-fold cross validation, meaning all items are used in training).

📗 Answer: .

📗 [4 points] What is the conditional entropy \(H\left(B|A\right)\) for the following set of training examples.

item	A	B
1
2
3
4
5
6
7
8

📗 Answer: .

📗 [4 points] "It" has a house with many doors. A random door is about to be opened with equal probability. Doors to have monsters that eat people. Doors to are safe. With sufficient bribe, Pennywise will answer your question "Will door 1 be opened?" What's the information gain (also called mutual information) between Pennywise's answer and your encounter with a monster?

📗 Answer: .

📗 [3 points] A hospital trains a decision tree to predict if any given patient has technophobia or not. The training set consists of patients. There are features. The labels are binary. The decision tree is not pruned. What are the smallest and largest possible training set accuracy of the decision tree? Enter two numbers between 0 and 1. Hint: patients with the same features may have different labels.

📗 Answer (comma separated vector): .

📗 [3 points] Given five decision stumps (decision trees with depth 1) in a random forest in the following table, what is the predicted label for a new data point \(x\) = \(\begin{bmatrix} x_{1} & x_{2} & ... \end{bmatrix}\) = ? Enter a single number (-1 or 1; and 0 in case of a tie).

Index	Decision stump	-
1	Label 1 if	Label -1 otherwise
2	Label 1 if	Label -1 otherwise
3	Label 1 if	Label -1 otherwise
4	Label 1 if	Label -1 otherwise
5	Label 1 if	Label -1 otherwise

📗 Answer: .

📗 [3 points] A hard margin SVM (Support Vector Machine) is trained on the following dataset. Suppose we restrict \(b\) = , what is the value of \(w\)? Enter a single number, i.e. do not include \(b\). Assume the SVM classifier is \(1_{\left\{w x + b \geq 0\right\}}\) (this means it predict 1 if \(w x + b \geq 0\) and 0 otherwise.

\(x_{i}\)
\(y_{i}\)

📗 Answer: .

📗 [4 points] Given a linear SVM (Support Vector Machine) that perfectly classifies a set of training data containing positive examples and negative examples. What is the maximum possible number of training examples that could be removed and still produce the exact same SVM as derived for the original training set?

📗 Answer: .

📗 [4 points] Given the following training set, add one item \(\begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix}\) with \(y\) = so that all 7 items are support vectors for the Hard Margin SVM (Support Vector Machine) trained on the new training set.

\(x_{1}\)	\(x_{2}\)	\(y\)
		0
		0
		0
		1
		1
		1

📗 Answer (comma separated vector): .

📗 [2 points] What are the smallest and largest values of subderivatives of at \(x = 0\).

📗 Answer (comma separated vector): .

📗 [4 points] Given two items \(x_{1}\) = and \(x_{2}\) = , suppose the feature map for a kernel SVM (Support Vector Machine) is \(\varphi\left(x\right)\) = , what is the kernel (Gram) matrix?

📗 Answer (matrix with multiple lines, each line is a comma separated vector): .

📗 [4 points] Consider a kernel \(K\left(x_{i_{1}}, x_{i_{2}}\right)\) = + , where both \(x_{i_{1}}\) and \(x_{i_{2}}\) are 1D positive real numbers. What is the feature vector \(\varphi\left(x_{i}\right)\) induced by this kernel evaluated at \(x_{i}\) = ?

📗 Answer (comma separated vector): .

📗 [3 points] In one iteration of the Perceptron Algorithm, \(x\) = , \(y\) = , and predicted label \(\hat{y} = a\) = . The learning rate \(\alpha = 1\). After the iteration, how many of the weights (include bias \(b\)) are increased (the change is strictly larger than 0). If it is impossible to figure out given the information, enter -1.

📗 Answer: .

📗 [2 points] The Perceptron algorithm does not terminate (cannot converge) for any learning rate on the following training set. Give an example of \(y_{1}\) (either 0 or 1). If there are multiple possible answers, enter only one, and if no such \(y_{1}\) exists, enter \(-1\).

\(i\)	\(x_{i}\)	\(y_{i}\)
1		\(y_{1}\)
2
3
4

📗 Answer: .

📗 [3 points] What is the minimum zero-one cost of a binary (y is either 0 or 1) linear (threshold) classifier (for example, an LTU (Linear Threshold Unit) perceptron) on the following data set?

\(x_{i}\)	1	2	3	4	5	6
\(y_{i}\)

📗 Answer: .

📗 [4 points] Suppose the partial derivative of a cost function \(C\) with respect to some weight \(w\) is given by \(\dfrac{\partial C}{\partial w} = \dfrac{\partial C_{1}}{\partial w} + \dfrac{\partial C_{2}}{\partial w} + \dfrac{\partial C_{3}}{\partial w} + ...\) where \(\dfrac{\partial C_{i}}{\partial w} = w x_{i}\). Given a data set \(x\) = {} and initial weight \(w\) = , compute and compare the updated weight after 1 step of batch gradient descent and steps of stochastic gradient descent (start with the same initial weight, then use data point 1 in step 1, data point 2 in step 2, ...). Enter two numbers, comma separated, batch gradient descent first. Use the learning rate \(\alpha\) = .

📗 Answer (comma separated vector): .

📗 [4 points] Given the following neural network that classifies all the training items correctly. What are the labels (0 or 1) of the training data? The activation functions are LTU for all units: \(1_{\left\{z \geq 0\right\}}\). The first layer weight matrix is , with bias vector , and the second layer weight vector is , with bias

\(x_{i1}\)	\(x_{i2}\)	\(y_{i}\) or \(a^{\left(2\right)}_{1}\)
0	0	?
0	1	?
1	0	?
1	1	?

📗 Answer (comma separated vector): .

📗 [2 points] In a three-layer (fully connected) neural network, the first hidden layer contains sigmoid units, the second hidden layer contains units, and the output layer contains units. The input is dimensional. How many weights plus biases does this neural network have? Enter one number.

📗 Answer: .

📗 [1 points] Please enter any comments including possible mistakes and bugs with the questions or your answers. If you have no comments, please enter "None": do not leave it blank.

📗 Answer: .

# Grade

* * * * *

* * * * *

# Submission

📗 Please do not modify the content in the above text field: use the "Grade" button to update.

📗 You could save the text in the above text box to a file using the button or copy and paste it into a file yourself .

📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##m: 1" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.

Last Updated: July 01, 2025 at 1:49 AM