📗 [4 points] Given the training set below and find the label of the decision tree that achieves 100 percent accuracy. Enter \(\hat{y}_{1}, \hat{y}_{2}, \hat{y}_{3}, \hat{y}_{4}\) as a vector.
📗 The training set:
\(x_{1}\) \(x_{2}\) \(y\)
\(0\) \(0\)
\(0\) \(1\)
\(1\) \(0\)
\(1\) \(1\)

📗 The decision tree:
if \(x_{1} \leq 0.5\) if \(x_{2} \leq 0.5\) label \(\hat{y}_{1}\)
- else \(x_{2} > 0.5\) label \(\hat{y}_{2}\)
else \(x_{1} > 0.5\) if \(x_{2} \leq 0.5\) label \(\hat{y}_{3}\)
- else \(x_{2} > 0.5\) label \(\hat{y}_{4}\)

📗 Answer (comma separated vector): .
📗 [4 points] Given a neural network with 1 hidden layer with hidden units, suppose the current hidden layer weights are \(w^{\left(1\right)}\) = = , and the output layer weights are \(w^{\left(2\right)}\) = = . Given an instance (item) \(x\) = and \(y\) = , the activation values are \(a^{\left(1\right)}\) = = and \(a^{\left(2\right)}\) = . What is updated weight \(w^{\left(1\right)}_{21}\) after one step of stochastic gradient descent based on \(x\) with learning rate \(\alpha\) = ? The activation functions are all and the cost is square loss.

📗 Reminder: logistic activation has gradient \(\dfrac{\partial a_{i}}{\partial z_{i}} = a_{i} \left(1 - a_{i}\right)\), tanh activation has gradient \(\dfrac{\partial a_{i}}{\partial z_{i}} = 1 - a_{i}^{2}\), ReLU activation has gradient \(\dfrac{\partial a_{i}}{\partial z_{i}} = 1_{\left\{a_{i} \geq 0\right\}}\), and square cost has gradient \(\dfrac{\partial C_{i}}{\partial a_{i}} = a_{i} - y_{i}\).
📗 Answer: .
📗 [3 points] Given the following training set, what is the maximum accuracy of a decision tree with depth 1 trained on this set? Enter a number between 0 and 1.
index \(x_{1}\) \(y\)

📗 Answer: .
📗 [4 points] Given the two training points and and their labels \(0\) and \(1\). What is the kernel (Gram) matrix if the RBF (radial basis function) Gaussian kernel with \(\sigma\) = is used? Use the formula \(K_{i i'} = e^{- \dfrac{1}{2 \sigma^{2}} \left(x_{i} - x_{i'}\right)^\top \left(x_{i} - x_{i'}\right)}\).
📗 Answer (matrix with multiple lines, each line is a comma separated vector): .
📗 [3 points] Suppose a soft margin support vector machine is trained on two points, \(x_{1}\) = , \(y_{1}\) = and \(x_{2}\) = , \(y_{2}\) = . Given the regularization parameter \(\lambda\) = , what is the soft margin loss at \(w\) = and \(b\) = ? Use \(C = \dfrac{\lambda}{2} w^\top w + \dfrac{1}{n} \displaystyle\sum_{i=1}^{n} \displaystyle\max\left\{0, 1 - \left(2 y_{i} - 1\right)\left(w^\top x_{i} + b\right)\right\}\).
📗 Answer: .
📗 [4 points] A convolutional neural network has input image of size x that is connected to a convolutional layer that uses a x filter, zero padding of the image, and a stride of 1. There are activation maps. (Here, zero-padding implies that these activation maps have the same size as the input images.) The convolutional layer is then connected to a pooling layer that uses x max pooling, a stride of (non-overlapping, no padding) of the convolutional layer. The pooling layer is then fully connected to an output layer that contains output units. There are no hidden layers between the pooling layer and the output layer. How many different weights must be learned in this whole network, not including any bias.
📗 Answer: .
📗 [3 points] A hospital trains a decision tree to predict if any given patient has technophobia or not. The training set consists of patients. There are features. The labels are binary. The decision tree is not pruned. What are the smallest and largest possible training set accuracy of the decision tree? Enter two numbers between 0 and 1. Hint: patients with the same features may have different labels.
📗 Answer (comma separated vector): .
📗 [4 points] Given a linear SVM (Support Vector Machine) that perfectly classifies a set of training data containing positive examples and negative examples. What is the minimum possible number of training examples that need be removed to cause the margin of a linear SVM to increase? If the answer is impossible, enter "-1".
📗 Answer: .
📗 [4 points] Suppose the squared loss is used to do stochastic gradient descent for logistic regression, i.e. \(C = \dfrac{1}{2} \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)^{2}\) where \(a_{i} = \dfrac{1}{1 + e^{- w x_{i} - b}}\). Given the current weight \(w\) = and bias \(b\) = , with \(x_{i}\) = , \(y_{i}\) = , \(a_{i}\) = (no need to recompute this value), with learning rate \(\alpha\) = . What is the updated after the iteration? Enter a single number.
📗 Answer: .
📗 [4 points] "It" has a house with many doors. A random door is about to be opened with equal probability. Doors to have monsters that eat people. Doors to are safe. With sufficient bribe, Pennywise will answer your question "Will door 1 be opened?" What's the information gain (also called mutual information) between Pennywise's answer and your encounter with a monster?
📗 Answer: .
📗 [4 points] List English letters from A to Z: ABCDEFGHIJKLMNOPQRSTUVWXYZ. Define the distance between two letters in the natural way, that is \(d\left(A, A\right) = 0\), \(d\left(A, B\right) = 1\), \(d\left(A, C\right) = 2\) and so on. Each letter has a label, are labeled 0, and the others are labeled 1. This is your training data. Now classify each letter using kNN (k Nearest Neighbor) for odd \(k = 1, 3, 5, 7, ...\). What is the smallest \(k\) where all letters are classified the same (same label, i.e. either all labels are 0s or all labels are 1s). Break ties by preferring the earlier letters in the alphabet. Hint: the nearest neighbor of a letter is the letter itself.
📗 Answer: .
📗 [4 points] In a convolutional neural network, suppose the activation map of a convolution layer is . What is the activation map after a non-overlapping (stride 2) 2 by 2 max-pooling layer?
📗 Answer (matrix with multiple lines, each line is a comma separated vector): .
📗 [4 points] Consider the problem of detecting if an email message is a spam. Say we use four random variables to model this problem: a binary class variable \(S\) indicates if the message is a spam, and three binary feature variables: \(C, F, N\) indicating whether the message contains "Cash", "Free", "Now". We use a Naive Bayes classifier with associated CPTs (Conditional Probability Table):
Prior \(\mathbb{P}\left\{S = 1\right\}\) = - -
Hams \(\mathbb{P}\left\{C = 1 | S = 0\right\}\) = \(\mathbb{P}\left\{F = 1 | S = 0\right\}\) = \(\mathbb{P}\left\{N = 1 | S = 0\right\}\) =
Spams \(\mathbb{P}\left\{C = 1 | S = 1\right\}\) = \(\mathbb{P}\left\{F = 1 | S = 1\right\}\) = \(\mathbb{P}\left\{N = 1 | S = 1\right\}\) =

Compute \(\mathbb{P}\){\(C\) = , \(F\) = , \(N\) = }.
📗 Answer: .
📗 [3 points] Consider the following directed graphical model over binary variables: \(A \to  B \leftarrow C\). Given the CPTs (Conditional Probability Table):
Variable Probability Variable Probability
\(\mathbb{P}\left\{A = 1\right\}\)
\(\mathbb{P}\left\{C = 1\right\}\)
\(\mathbb{P}\left\{B = 1 | A = C = 1\right\}\) \(\mathbb{P}\left\{B = 1 | A = 0, C = 1\right\}\)
\(\mathbb{P}\left\{B = 1 | A = 1, C = 0\right\}\) \(\mathbb{P}\left\{B = 1 | A = C = 0\right\}\)

What is the probability that \(\mathbb{P}\){ \(A\) = , \(B\) = , \(C\) = }?
📗 Answer: .
📗 [1 points] Please enter any comments including possible mistakes and bugs with the questions or your answers. If you have no comments, please enter "None": do not leave it blank.
📗 Answer: .

