📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit enter key) 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30xB1
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could print the page: , solve the problems, then enter all your answers at the end.
📗 In case the questions are not generated correctly, try (1) refresh the page, (2) clear the browser cache, Ctrl+F5 or Ctrl+Shift+R or Shift+Command+R, (3) switch to incognito/private browsing mode, (4) switch to another browser, (5) use a different ID. If none of these work, message me on Zoom.
📗 [3 points] Suppose there is a single integer input \(x\) = {\(0\), \(1\), ..., }, and the label is binary \(y\) = {\(0\), \(1\)}. Let \(\mathcal{H}\) be a hypothesis space containing all possible linear classifiers. How many unique classifiers are there in \(\mathcal{H}\)? For example, the three linear classifiers \(1_{\left\{x < 0.4\right\}}\), \(1_{\left\{x \leq 0.4\right\}}\) and \(1_{\left\{x < 0.6\right\}}\) are considered the same classifier since they classify all possible data sets the same way.
📗 Answer: .
📗 [3 points] Suppose there are \(2\) discrete features \(x_{1}, x_{2}\) that can take on values and , and a binary decision tree is trained based on these features. What is the maximum number of leafs the decision tree can have?
📗 Answer: .
📗 [3 points] Suppose ( + + ) entries are stored in conditional probability tables of three binary variables \(X_{1}, X_{2}, X_{3}\)? What is the configuration of Bayesian network? Enter 1 for causal chain (e.g. \(X_{1} \to X_{2} \to X_{3}\)), enter 2 for common cause (e.g. \(X_{1} \leftarrow X_{2} \to X_{3}\)) and enter 3 for common effect (e.g. \(X_{1} \to X_{2} \leftarrow X_{3}\)), and enter -1 if more information is needed or more than one of the previous configurations are possible.
📗 Answer: .
📗 [3 points] If the joint probabilities of the Bayesian network \(X_{1} \to X_{2} \to X_{3} \to ... \to X_{n}\) with \(n\) = binary variables are stored in a table (instead of the conditional probability tables (CPT)), what is the size of the table?
📗 For example, if the network is \(X_{1} \to X_{2}\), then the size of the joint probability table is 3, containing entries \(\mathbb{P}\left\{X_{1}, X_{2}\right\}, \mathbb{P}\left\{X_{1}, \neg X_{2}\right\}, \mathbb{P}\left\{\neg X_{1}, X_{2}\right\}\), because the joint probability \(\mathbb{P}\left\{\neg X_{1}, \neg X_{2}\right\} = 1 - \mathbb{P}\left\{X_{1}, X_{2}\right\} - \mathbb{P}\left\{X_{1}, \neg X_{2}\right\} - \mathbb{P}\left\{\neg X_{1}, X_{2}\right\}\) can be computed based on the other entries in the table.
📗 Answer: .
📗 [3 points] Given an infinite state sequence where the pattern "" is repeated infinite number of times. What is the (maximum likelihood) estimated transition probability from state to (without smoothing)?
📗 Answer: .
📗 [3 points] Suppose you are given a neural network with hidden layers, input units, output units, and hidden units. In one backpropogation step when computing the gradient of the cost (for example, squared loss) with respect to \(w^{\left(1\right)}_{11}\), the weight in layer \(1\) connecting input \(1\) and hidden unit \(1\), how many weights (including \(w^{\left(1\right)}_{11}\) itself, and including biases) are used in the backpropogation step of \(\dfrac{\partial C}{\partial w^{\left(1\right)}_{11}}\)?
📗 The above is a diagram of the network, the nodes labelled "1" are the bias units. You can highlight the edges representing the weights in the diagram, but they are not graded. Note: the backpropogation step assumes the activations in all layers are already known so do not count the weights and biases in the forward step computing the activations.
📗 Answer: .
📗 [3 points] A tweet is ratioed if at least one reply gets more likes than the tweet. Suppose a tweet has replies, and each one of these replies gets more likes than the tweet with probability if the tweet is bad, and probability if the tweet is good. Given a tweet is ratioed, what is the probability that it is a bad tweet? The prior probability of a bad tweet is .
📗 Answer: .
📗 [3 points] A hard margin SVM (Support Vector Machine) is trained on the following dataset. Suppose we restrict \(b\) = , what is the value of \(w\)? Enter a single number, i.e. do not include \(b\). Assume the SVM classifier is \(1_{\left\{w x + b \geq 0\right\}}\) (this means it predict 1 if \(w x + b \geq 0\) and 0 otherwise.
\(x_{i}\)
\(y_{i}\)
📗 Answer: .
📗 [4 points] Consider the following Markov Decision Process. It has two states \(s\), A and B. It has two actions \(a\): move and stay. The state transition is deterministic: "move" moves to the other state, while "stay" stays at the current state. The reward \(r\) is for move, for stay. Suppose the discount rate is \(\beta\) = .
Find the Q table \(Q_{i}\) after \(i\) = updates of every entry using Q value iteration (\(i = 0\) initializes all values to \(0\)) in the format described by the following table. Enter a two by two matrix.
State \ Action
stay
move
A
?
?
B
?
?
📗 Answer (matrix with multiple lines, each line is a comma separated vector): .
📗 [3 points] Given five decision stumps (decision trees with depth 1) in a random forest in the following table, what is the predicted label for a new data point \(x\) = \(\begin{bmatrix} x_{1} & x_{2} & ... \end{bmatrix}\) = ? Enter a single number (-1 or 1; and 0 in case of a tie).
Index
Decision stump
-
1
Label 1 if
Label -1 otherwise
2
Label 1 if
Label -1 otherwise
3
Label 1 if
Label -1 otherwise
4
Label 1 if
Label -1 otherwise
5
Label 1 if
Label -1 otherwise
📗 Answer: .
📗 [3 points] In one step of gradient descent for a \(L_{2}\) regularized logistic regression, suppose \(w\) = , \(b\) = , and \(\dfrac{\partial C}{\partial w}\) = , \(\dfrac{\partial C}{\partial b}\) = . If the learning rate is \(\alpha\) = and the regularization parameter is \(\lambda\) = , what is \(w\) after one iteration? Use the loss \(C\left(w, b\right)\) and the regularization \(\dfrac{\lambda}{2} \left\|\begin{bmatrix} w \\ b \end{bmatrix}\right\|_{2}^{2}\) = \(\dfrac{\lambda}{2} \left(w^{2} + b^{2}\right)\).
📗 Answer: .
📗 [3 points] Given there are data points, each data point has features, the feature map creates new features (to replace the original features). What is the size of the kernel matrix when training a kernel SVM (Support Vector Machine)? For example, if the matrix is \(2 \times 2\), enter the number \(4\).
📗 Answer: .
📗 [3 points] Suppose the likelihood probabilities of observing "a", "o", "c" in a real movie script is , and the likelihood probabilities of observing "a", "o", "c" in a fake movie script is . Given the prior probabilities, of the scripts are real. How would a Naive Bayes classifier classify a script ""? Enter \(1\) if it is classified as real, enter \(-1\) if it is classified as fake, and enter \(0\) if it's a tie (equally likely to be real and fake).
📗 Answer: .
📗 [3 points] In one iteration of the Perceptron Algorithm, \(x\) = , \(y\) = , and predicted label \(\hat{y} = a\) = . The learning rate \(\alpha = 1\). After the iteration, how many of the weights (include bias \(b\)) are increased (the change is strictly larger than 0). If it is impossible to figure out given the information, enter -1.
📗 Answer: .
📗 [2 points] What are the smallest and largest values of subderivatives of at \(x = 0\).
📗 Answer (comma separated vector): .
📗 [4 points] Given the following training data, what is the fold cross validation accuracy (i.e. LOOCV, Leave One Out Cross Validation) if NN (Nearest Neighbor) classifier with Manhattan distance is used. Break the tie (in distance) by using the instance with the smaller index. Enter a number between 0 and 1.
Index
1
2
3
4
5
\(x_{i}\)
\(y_{i}\)
📗 Answer: .
📗 [2 points] There is a total of red or green balls in a bag. How many red balls and how many green balls are there so that the entropy of the color of a randomly selected ball is imized?
📗 Answer (comma separated vector): .
📗 [4 points] A convolutional neural network has input image of size x that is connected to a convolutional layer that uses a x filter, zero padding of the image, and a stride of 1. There are activation maps. (Here, zero-padding implies that these activation maps have the same size as the input images.) The convolutional layer is then connected to a pooling layer that uses x max pooling, a stride of (non-overlapping, no padding) of the convolutional layer. The pooling layer is then fully connected to an output layer that contains output units. There are no hidden layers between the pooling layer and the output layer. How many different weights must be learned in this whole network, not including any bias.
📗 Answer: .
📗 [4 points] Say we use Naive Bayes in an application where there are features represented by variables, each having possible values, and there are classes. How many probabilities must be stored in the CPTs (Conditional Probability Table) in the Bayesian network for this problem? Do not include probabilities that can be computed from other probabilities.
📗 Answer: .
📗 [4 points] Given the following transition matrix for a bigram model with words "Eat" (label 0), "My" (label 1) and "Hammer" (label 2): . Row \(i\) column \(j\) is \(\mathbb{P}\left\{w_{t} = j | w_{t-1} = i\right\}\). Two uniform random numbers between 0 and 1 are generated to simulate the words after "Eat", say \(u_{1}\) = and \(u_{2}\) = . Using the CDF (Cumulativ Distribution Function) inversion method (inverse transform method), which two words are generated? Enter two integer labels (0, 1, or 2), not strings.
📗 Answer (comma separated vector): .
📗 [2 points] We have a biased coin with probability of producing Heads. We create a predictor as follows: generate a random number uniformly distributed in (0, 1). If the random number is less than we predict Heads, otherwise, we predict Tails. What is this predictor's (expected) accuracy in predicting the coin's outcome?
📗 Answer: .
📗 [3 points] We use gradient descent to find the minimum of the function \(f\left(x\right)\) = with step size \(\eta > 0\). If we start from the point \(x_{0}\) = , how small should \(\eta\) be so we make progress in the first iteration? Enter the largest number of \(\eta\) below which we make progress. For example, if we make progress when \(\eta < 0.01\), enter \(0.01\).
📗 Answer: .
📗 [2 points] Suppose an SVM (Support Vector Machine) has \(w\) = and \(b\) = . What is the actual distance between the two planes defined by \(w^\top x + b\) = and \(w^\top x + b\) = . Note that these are not the SVM plus and minus planes.
📗 Answer: .
📗 [4 points] Consider a kernel \(K\left(x_{i_{1}}, x_{i_{2}}\right)\) = + , where both \(x_{i_{1}}\) and \(x_{i_{2}}\) are 1D positive real numbers. What is the feature vector \(\varphi\left(x_{i}\right)\) induced by this kernel evaluated at \(x_{i}\) = ?
📗 Answer (comma separated vector): .
📗 [3 points] An UFO is hiding in a cloud near Haywood Ranch. On given day, the UFO hides in the cloud of the time (C = 0), and comes out of the cloud of the time (C = 1). What is the entropy of the binary random variable C? Reminder that log based 2 of x can be found by log(x) / log(2).
📗 Answer: .
📗 [3 points] Let a dataset consist of \(n\) = points in \(\mathbb{R}\), specifically, the first \(n - 1\) points are and the last point \(x_{n}\) is unknown. What is the smallest value of \(x_{n}\) above which \(x_{n-1}\) is among \(x_{n}\)'s 3-nearest neighbors, but \(x_{n}\) is NOT among \(x_{n-1}\)'s 3-nearest neighbor? Note that the 3-nearest neighbors of a point in the training set include the point itself.
📗 Answer: .
📗 [4 points] Fill in the missing values in the following joint probability table so that A and B are independent.
-
A = 0
A = 1
B = 0
B = 1
??
??
📗 Answer (comma separated vector): .
📗 [3 points] Consider the following directed graphical model over binary variables: \(A \leftarrow B \to C\). Given the CPTs (Conditional Probability Table):
Variable
Probability
Variable
Probability
\(\mathbb{P}\left\{B = 1\right\}\)
\(\mathbb{P}\left\{C = 1 | B = 1\right\}\)
\(\mathbb{P}\left\{C = 1 | B = 0\right\}\)
\(\mathbb{P}\left\{A = 1 | B = 1\right\}\)
\(\mathbb{P}\left\{A = 1 | B = 0\right\}\)
What is the probability that \(\mathbb{P}\){ \(A\) = , \(B\) = , \(C\) = }?
📗 Answer: .
📗 [2 points] In a corpus (set of documents) with word types (unique word tokens), the phrase "" appeared times. In particular, "" appeared times and "" appeared . If we estimate probability by frequency (the maximum likelihood estimate) with Laplace smoothing (add-1 smoothing), what is the estimated probability of \(\mathbb{P}\){ | }?
📗 Answer: .
📗 [1 points] Please enter any comments including possible mistakes and bugs with the questions or your answers. If you have no comments, please enter "None": do not leave it blank.
📗 Please do not modify the content in the above text field: use the "Grade" button to update.
📗 Please wait for the message "Successful submission." to appear after the "Submit" button. Please also save the text in the above text box to a file using the button or copy and paste it into a file yourself and submit it to Canvas Assignment MB. You could submit multiple times (but please do not submit too often): only the latest submission will be counted.
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##x: B" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.