📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit enter key) 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30xA2
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could print the page: , solve the problems, then enter all your answers at the end.
📗 In case the questions are not generated correctly, try (1) refresh the page, (2) clear the browser cache, Ctrl+F5 or Ctrl+Shift+R or Shift+Command+R, (3) switch to incognito/private browsing mode, (4) switch to another browser, (5) use a different ID. If none of these work, message me on Zoom.
📗 [3 points] In one iteration of the Perceptron Algorithm, the initial weights are \(w\) = and \(b\) = , with \(x\) = , \(y \in \left\{0, 1\right\}\), and learning rate \(\alpha = 1\). After the iteration, the weights remain unchanged. What is the correct label \(y\)? The LTU perceptron classifier is \(1_{\left\{w x + b \geq 0\right\}}\).
📗 Answer: .
📗 [3 points] Suppose you are given a neural network with hidden layers, input units, output units, and hidden units. In one backpropogation step when computing the gradient of the cost (for example, squared loss) with respect to \(w^{\left(1\right)}_{11}\), the weight in layer \(1\) connecting input \(1\) and hidden unit \(1\), how many weights (including \(w^{\left(1\right)}_{11}\) itself, and including biases) are used in the backpropogation step of \(\dfrac{\partial C}{\partial w^{\left(1\right)}_{11}}\)?
📗 The above is a diagram of the network, the nodes labelled "1" are the bias units. You can highlight the edges representing the weights in the diagram, but they are not graded. Note: the backpropogation step assumes the activations in all layers are already known so do not count the weights and biases in the forward step computing the activations.
📗 Answer: .
📗 [3 points] In one step of gradient descent for a \(L_{2}\) regularized logistic regression, suppose \(w\) = , \(b\) = , and \(\dfrac{\partial C}{\partial w}\) = , \(\dfrac{\partial C}{\partial b}\) = . If the learning rate is \(\alpha\) = and the regularization parameter is \(\lambda\) = , what is \(w\) after one iteration? Use the loss \(C\left(w, b\right)\) and the regularization \(\dfrac{\lambda}{2} \left\|\begin{bmatrix} w \\ b \end{bmatrix}\right\|_{2}^{2}\) = \(\dfrac{\lambda}{2} \left(w^{2} + b^{2}\right)\).
📗 Answer: .
📗 [3 points] A hard margin SVM (Support Vector Machine) is trained on the following dataset. Suppose we restrict \(b\) = , what is the value of \(w\)? Enter a single number, i.e. do not include \(b\). Assume the SVM classifier is \(1_{\left\{w x + b \geq 0\right\}}\) (this means it predict 1 if \(w x + b \geq 0\) and 0 otherwise.
\(x_{i}\)
\(y_{i}\)
📗 Answer: .
📗 [3 points] Given there are data points, each data point has features, the feature map creates new features (to replace the original features). What is the size of the kernel matrix when training a kernel SVM (Support Vector Machine)? For example, if the matrix is \(2 \times 2\), enter the number \(4\).
📗 Answer: .
📗 [3 points] Given three decision stumps in a random forest in the following table, what is the predicted label for a new data point \(x\) = \(\begin{bmatrix} x_{1} \\ x_{2} \\ ... \end{bmatrix}\) = ? Enter a single number (-1 or 1; and 0 in case of a tie).
Index
Decision stump
-
1
Label 1 if
Label -1 otherwise
2
Label 1 if
Label -1 otherwise
3
Label 1 if
Label -1 otherwise
📗 Answer: .
📗 [3 points] A tweet is ratioed if at least one reply gets more likes than the tweet. Suppose a tweet has replies, and each one of these replies gets more likes than the tweet with probability if the tweet is bad, and probability if the tweet is good. Given a tweet is ratioed, what is the probability that it is a bad tweet? The prior probability of a bad tweet is .
📗 Answer: .
📗 [3 points] Given an infinite state sequence where the pattern "" is repeated infinite number of times. What is the (maximum likelihood) estimated transition probability from state to (without smoothing)?
📗 Answer: .
📗 [3 points] Given a Bayesian network \(A \to B \to C \to D \to E\) of 5 binary event variables with the following conditional probability table (CPT), what is the probability that none of the events happen, \(\mathbb{P}\left\{\neg A, \neg B, \neg C, \neg D, \neg E\right\}\)?
\(\mathbb{P}\left\{A\right\}\) =
\(\mathbb{P}\left\{B | A\right\}\) =
\(\mathbb{P}\left\{C | B\right\}\) =
\(\mathbb{P}\left\{D | C\right\}\) =
\(\mathbb{P}\left\{E | D\right\}\) =
\(\mathbb{P}\left\{\neg A\right\}\) =
\(\mathbb{P}\left\{B | \neg A\right\}\) =
\(\mathbb{P}\left\{C | \neg B\right\}\) =
\(\mathbb{P}\left\{D | \neg C\right\}\) =
\(\mathbb{P}\left\{E | \neg D\right\}\) =
📗 Answer: .
📗 [2 points] What are the smallest and largest values of subderivatives of at \(x = 0\).
📗 Answer (comma separated vector): .
📗 [4 points] Say we have a training set consisting of positive examples and negative examples where each example is a point in a two-dimensional, real-valued feature space. What will the classification accuracy be on the training set with NN (Nearest Neighbor).
📗 Answer: .
📗 [4 points] Given the following training data, what is the fold cross validation accuracy (i.e. LOOCV, Leave One Out Cross Validation) if NN (Nearest Neighbor) classifier with Manhattan distance is used. Break the tie (in distance) by using the instance with the smaller index. Enter a number between 0 and 1.
Index
1
2
3
4
5
6
\(x_{i}\)
\(y_{i}\)
📗 Answer: .
📗 [2 points] There is a total of red or green balls in a bag. How many red balls and how many green balls are there so that the entropy of the color of a randomly selected ball is imized?
📗 Answer (comma separated vector): .
📗 [4 points] In a convolutional neural network, suppose the activation map of a convolution layer is . What is the activation map after a non-overlapping (stride 2) 2 by 2 max-pooling layer?
📗 Answer (matrix with multiple lines, each line is a comma separated vector): .
📗 [4 points] A convolutional neural network has input image of size x that is connected to a convolutional layer that uses a x filter, zero padding of the image, and a stride of 1. There are activation maps. (Here, zero-padding implies that these activation maps have the same size as the input images.) The convolutional layer is then connected to a pooling layer that uses x max pooling, a stride of (non-overlapping, no padding) of the convolutional layer. The pooling layer is then fully connected to an output layer that contains output units. There are no hidden layers between the pooling layer and the output layer. How many different weights must be learned in this whole network, not including any bias.
📗 Answer: .
📗 [3 points] Given two Boolean random variables, \(A\) and \(B\), where \(\mathbb{P}\left\{A\right\}\) = , \(\mathbb{P}\left\{B\right\}\) = , and \(\mathbb{P}\left\{A| \neg B\right\}\) = , what is \(\mathbb{P}\left\{A|B\right\}\)?
📗 Answer: .
📗 [2 points] Let \(X \in\) and \(Y \in\) . What is the least number of probabilities needed to fully specify the CPT (Conditional Probability Table) of the \(Y\) given \(X\) (i.e. \(\mathbb{P}\left\{Y | X\right\}\))? Note that this is not a part of the CPTs in the Naive Bayes model. Do not include probabilities that can be computed from other probabilities.
📗 Answer: .
📗 [4 points] Say we use Naive Bayes in an application where there are features represented by variables, each having possible values, and there are classes. How many probabilities must be stored in the CPTs (Conditional Probability Table) in the Bayesian network for this problem? Do not include probabilities that can be computed from other probabilities.
📗 Answer: .
📗 [4 points] Given the following transition matrix for a bigram model with words "I" (label 0), "am" (label 1) and "Groot" (label 2): . Row \(i\) column \(j\) is \(\mathbb{P}\left\{w_{t} = j | w_{t-1} = i\right\}\). Two uniform random numbers between 0 and 1 are generated to simulate the words after "I", say \(u_{1}\) = and \(u_{2}\) = . Using the CDF (Cumulativ Distribution Function) inversion method (inverse transform method), which two words are generated? Enter two integer labels (0, 1, or 2), not strings.
📗 Answer (comma separated vector): .
📗 [3 points] What is the minimum zero-one cost of a binary (y is either 0 or 1) linear (threshold) classifier (for example, an LTU (Linear Threshold Unit) perceptron) on the following data set?
\(x_{i}\)
1
2
3
4
5
6
\(y_{i}\)
📗 Answer: .
📗 [2 points] In a three-layer (fully connected) neural network, the first hidden layer contains sigmoid units, the second hidden layer contains units, and the output layer contains units. The input is dimensional. How many weights plus biases does this neural network have? Enter one number.
📗 Answer: .
📗 [2 points] Suppose an SVM (Support Vector Machine) has \(w\) = and \(b\) = . What is the actual distance between the two planes defined by \(w^\top x + b = -1\) and \(w^\top x + b = 1\).
📗 Answer: .
📗 [4 points] Consider a kernel \(K\left(x_{i_{1}}, x_{i_{2}}\right)\) = + + , where both \(x_{i_{1}}\) and \(x_{i_{2}}\) are 1D positive real numbers. What is the feature vector \(\varphi\left(x_{i}\right)\) induced by this kernel evaluated at \(x_{i}\) = ?
📗 Answer (comma separated vector): .
📗 [3 points] Statistically, cats are often hungry around 6:00 am (I am making this up). At that time, a cat is hungry of the time (C = 1), and not hungry of the time (C = 0). What is the entropy of the binary random variable C? Reminder that log based 2 of x can be found by log(x) / log(2) or log2(x).
📗 Answer: .
📗 [3 points] Let a dataset consist of \(n\) = points in \(\mathbb{R}\), specifically, the first \(n - 1\) points are and the last point \(x_{n}\) is unknown. What is the smallest value of \(x_{n}\) above which \(x_{n-1}\) is among \(x_{n}\)'s 3-nearest neighbors, but \(x_{n}\) is NOT among \(x_{n-1}\)'s 3-nearest neighbor? Note that the 3-nearest neighbors of a point in the training set include the point itself.
📗 Answer: .
📗 [4 points] Fill in the missing values in the following joint probability table so that A and B are independent.
-
A = 0
A = 1
B = 0
B = 1
??
??
📗 Answer (comma separated vector): .
📗 [2 points] In a corpus with word tokens, the phrase "Home Lander" appeared times (not Homelander). In particular, "Home" appeared times and "Lander" appeared . If we estimate probability by frequency (the maximum likelihood estimate) without smoothing, what is the estimated probability of \(\mathbb{P}\){Lander | Home}?
📗 Answer: .
📗 [4 points] Given the following transition matrix for a bigram model with words "", "" and "": . Row \(i\) column \(j\) is \(\mathbb{P}\left\{w_{t} = j | w_{t-1} = i\right\}\). What is the probability that the third word is "" given the first word is ""?
📗 Answer: .
📗 [1 points] Give an estimate of the number of previous questions on this exam you think you answered correctly. Please enter an integer between 0 and the total number of questions on the exam (minus 2): do not leave it blank.
📗 Answer: .
📗 [1 points] Please enter any comments including possible mistakes and bugs with the questions or your answers. If you have no comments, please enter "None": do not leave it blank.
📗 Please do not modify the content in the above text field: use the "Grade" button to update.
📗 Please wait for the message "Successful submission." to appear after the "Submit" button. Please also save the text in the above text box to a file using the button or copy and paste it into a file yourself and submit it to Canvas Assignment MA. You could submit multiple times (but please do not submit too often): only the latest submission will be counted.
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##x: A" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.