Prev: M2 Next: M4
Back to week 1 page: Link

# Warning: this is a replica of the homework page for testing purposes, please use M3 for homework submission.


# M3 Written (Math) Problems

📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit enter key)
📗 You can also load from your saved file
and click .
📗 If the questions are not generated correctly, try refresh the page using the button at the top left corner.
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could print the page: , solve the problems, then enter all your answers at the end.
📗 Please do not refresh the page: your answers will not be saved.

# Warning: please enter your ID before you start!


# Question 1


# Question 2


# Question 3


# Question 4


# Question 5


# Question 6


# Question 7


# Question 8


# Question 9


# Question 10


# Question 11


📗 [4 points] Given the following neural network that classifies all the training instances correctly. What are the labels (0 or 1) of the training data? The activation functions are LTU for all units: \(1_{\left\{z \geq 0\right\}}\). The first layer weight matrix is , with bias vector , and the second layer weight vector is , with bias
\(x_{i1}\) \(x_{i2}\) \(y_{i}\) or \(o_{1}\)
0 0 ?
0 1 ?
1 0 ?
1 1 ?


Note: if the weights are not shown clearly, you could move the nodes around with mouse or touch.
Hint See Fall 2010 Final Q17. First compute the hidden layer units: \(h_{j} = 1_{\left\{w^{\left(1\right)}_{1j} x_{1} + w^{\left(1\right)}_{2j} x_{2} + b_{j} \geq 0\right\}}\), then compute the outputs (which are equal to the training data labels): \(y = o_{1} = 1_{\left\{w^{\left(2\right)}_{1} h_{1} + w^{\left(2\right)}_{2} h_{2} + b \geq 0\right\}}\). Repeat the computations for \(\left(x_{1}, x_{2}\right) = \left(0, 0\right), \left(0, 1\right), \left(1, 0\right), \left(1, 1\right)\).
📗 Answer (comma separated vector): .
📗 [4 points] Given the following neural network that classifies all the training instances correctly. What are the labels (0 or 1) of the training data? The activation functions are LTU for all units: \(1_{\left\{z \geq 0\right\}}\). The first layer weight matrix is , with bias vector , and the second layer weight vector is , with bias
\(x_{i1}\) \(x_{i2}\) \(y_{i}\) or \(o_{1}\)
0 0 ?
0 1 ?
1 0 ?
1 1 ?


Note: if the weights are not shown clearly, you could move the nodes around with mouse or touch.
Hint See Fall 2010 Final Q17. First compute the hidden layer units: \(h_{j} = 1_{\left\{w^{\left(1\right)}_{1j} x_{1} + w^{\left(1\right)}_{2j} x_{2} + b_{j} \geq 0\right\}}\), then compute the outputs (which are equal to the training data labels): \(y = o_{1} = 1_{\left\{w^{\left(2\right)}_{1} h_{1} + w^{\left(2\right)}_{2} h_{2} + b \geq 0\right\}}\). Repeat the computations for \(\left(x_{1}, x_{2}\right) = \left(0, 0\right), \left(0, 1\right), \left(1, 0\right), \left(1, 1\right)\).
📗 Answer (comma separated vector): .
📗 [4 points] Fill in the missing weight below so that it computes the following function. All inputs takes value 0 or 1, and the perceptrons are linear threshold units. The first layer weight matrix is , with bias vector , and the second layer weight vector is , with bias .
\(x_{1}\) \(x_{2}\) \(y\) or \(o_{1}\)
0 0
0 1
1 0
1 1


📗 Note: if the weights are not shown clearly, you could move the nodes around with mouse or touch.
Hint See Fall 2010 Final Q17. There are many possible answers: the weights should not be computed using gradient descent because all other weights are fixed. The one approach is to first figure out the hidden unit values (either 0 or 1 in this case) using the given weights. Then solve the inequality: if a hidden unit \(j\) is 0, \(w^{\left(1\right)}_{1j} x_{1} + w^{\left(1\right)}_{2j} x_{2} + b_{j} < 0\), and if the hidden unit is 1, \(w^{\left(1\right)}_{1j} x_{1} + w^{\left(1\right)}_{2j} x_{2} + b_{j} \geq 0\); or if output is 0, \(w^{\left(2\right)}_{1} h_{1} + w^{\left(2\right)}_{2} h_{2} + b < 0\), and if the output is 1, \(w^{\left(2\right)}_{1} h_{1} + w^{\left(2\right)}_{2} h_{2} + b \geq 0\).
📗 Answer: .
📗 [4 points] Fill in the missing weight below so that it computes the following function. All inputs takes value 0 or 1, and the perceptrons are linear threshold units. The first layer weight matrix is , with bias vector , and the second layer weight vector is , with bias .
\(x_{1}\) \(x_{2}\) \(y\) or \(o_{1}\)
0 0
0 1
1 0
1 1


📗 Note: if the weights are not shown clearly, you could move the nodes around with mouse or touch.
Hint See Fall 2010 Final Q17. There are many possible answers: the weights should not be computed using gradient descent because all other weights are fixed. The one approach is to first figure out the hidden unit values (either 0 or 1 in this case) using the given weights. Then solve the inequality: if a hidden unit \(j\) is 0, \(w^{\left(1\right)}_{1j} x_{1} + w^{\left(1\right)}_{2j} x_{2} + b_{j} < 0\), and if the hidden unit is 1, \(w^{\left(1\right)}_{1j} x_{1} + w^{\left(1\right)}_{2j} x_{2} + b_{j} \geq 0\); or if output is 0, \(w^{\left(2\right)}_{1} h_{1} + w^{\left(2\right)}_{2} h_{2} + b < 0\), and if the output is 1, \(w^{\left(2\right)}_{1} h_{1} + w^{\left(2\right)}_{2} h_{2} + b \geq 0\).
📗 Answer: .
📗 [2 points] Let the input \(x \in \mathbb{R}\). Thus the input layer has a single \(x\) input. The network has 5 hidden layers. Each hidden layer has 10 units. The output layer has a single unit and outputs \(y \in \mathbb{R}\). Between layers, the network is fully connected. All units in the network have a bias input. All units are linear units, namely the activation function is the identity function \(a = g\left(z\right) = z\), while \(z = w^\top x + b\) is a linear combination of all inputs to that unit (including the bias). Which functions can this network compute?
Hint See Fall 2017 Final Q19, Spring 2017 Final Q4. Combination of linear units can still only compute linear functions. We need non-linear activation functions in order for neural networks to approximate any continuous function.
📗 Choices:





None of the above
📗 [2 points] In a three-layer (fully connected) neural network, the first layer contains sigmoid units, the second layer contains units, and the output layer contains units. The input is dimensional. How many weights plus biases does this neural network have? Enter one number.

📗 The above is a diagram of the network, the nodes labelled "1" are the bias units.
Hint See Fall 2019 Final Q14, Fall 2013 Final Q8, Fall 2006 Final Q17, Fall 2005 Final Q17. Three-layer neural networks have one input layer (same number of units as the input dimension), two hidden layers, and one output layer (usually the same number of units as the number of classes (labels), but in case there are only two classes, the number of units can be 1). We are using the convention of calling neural networks with four layers "three-layer neural networks" because there are only three layers with weights and biases (so we don't count the input layer). The number of weights between two consecutive layers (\(m_{1}\) units in the previous layer, \(m_{2}\) units in the next layer) is \(m_{1} \cdot m_{2}\), and the number of biases is \(m_{2}\).
📗 Answer: .
📗 [3 points] The sigmoid function in a neural network is defined as \(g\left(x\right) = \dfrac{1}{1 + e^{-x}}\). There is an another activation function defined as \(h\left(x\right)\) = . If \(h\left(x\right) = a \cdot g\left(b \cdot x\right) + c\), write down the values of \(a, b, c\) (constants, they cannot be functions of \(x\)). In the diagram, the green line is \(h\left(x\right)\) and the red line is \(a \cdot g\left(b \cdot x\right) + c\) with the \(a, b, c\) you selected.
Hint See Fall 2017 Final Q23. Some relations that may be useful: \(1 - \dfrac{1}{1 + e^{-x}} = \dfrac{e^{-x}}{1 + e^{-x}}\) and \(\dfrac{e^{x}}{e^{x} + e^{-x}} = \dfrac{1}{1 + e^{-2 x}}\).

📗 Answers:
\(a\) = 0
\(b\) = 0
\(c\) = 0

📗 [3 points] Suppose you are given a neural network with hidden layers, input units, output units, and hidden units. In one backpropogation step when computing the gradient of the cost (for example, squared loss) with respect to \(w^{\left(1\right)}_{11}\), the weight in layer \(1\) connecting input \(1\) and hidden unit \(1\), how many weights (including \(w^{\left(1\right)}_{11}\) itself, and including biases) are used in the backpropogation step of \(\dfrac{\partial C}{\partial w^{\left(1\right)}_{11}}\)?

📗 The above is a diagram of the network, the nodes labelled "1" are the bias units. You can highlight the edges representing the weights in the diagram, but they are not graded. Note: the backpropogation step assumes the activations in all layers are already known so do not count the weights and biases in the forward step computing the activations.
Hint Write down the chain rule to compute gradient \(\dfrac{\partial C}{\partial w^{\left(1\right)}_{11}}\) and note that each element in the sum is a chain (product of derivatives) from the first unit of the first hidden layer to one of the output nodes. Therefore, the problem is equivalent to finding all paths from the first unit of the first hidden layer to all the output nodes. Draw a few paths to see the pattern and figure out a formula to count the total number of (distinct) edges in all those paths, and that would be the number of weights used in backprop.
📗 Answer: .
📗 [4 points] Consider a linear model \(a_{i} = w^\top x_{i} + b\), with the cross entropy cost function \(C\) = . The initial weight is \(\begin{bmatrix} w \\ b \end{bmatrix}\) = . What is the updated weight and bias after one (stochastic) gradient descent step if the chosen training data is \(x\) = , \(y\) = ? The learning rate is .
Hint The derivative of the cost function with respect to the weights given one training data point \(i\) can be computed as \(\dfrac{\partial C}{\partial w_{j}} = \dfrac{\partial C}{\partial a_{i}} \dfrac{\partial a_{i}}{\partial w_{j}}\), where \(\dfrac{\partial C}{\partial a_{i}}\) depends on the function given in the question and \(\dfrac{\partial a_{i}}{\partial w_{j}}\) is \(x_{i j}\) since the activation function is linear. The updated weight \(j\) can be found using the gradient descent formula \(w_{j} - \alpha \dfrac{\partial C}{\partial w_{j}}\). The derivative and update for \(b\) can be computed similarly.
📗 Answer (comma separated vector): .
📗 [2 points] A test set \(\left(x_{1}, y_{1}\right), ..., \left(x_{100}, y_{100}\right)\) contains labels \(y_{i}\) = for \(i = 1, ..., 100\). A classifier simply predicts all the time (the labels are +1 and -1). What is this classifier's test accuracy?
Hint See Fall 2014 Final Q4, Fall 2010 Final Q1. Write down the first few labels say \(i = 1, 2, 3, 4\) to see the pattern.
📗 Enter a fraction to represent the accuracy, for example, enter 0.5 if the accuracy is 50 percent and enter 1 if the accuracy is 100 percent.
📗 Answer: .
📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the questions that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.
📗 Answer: .

# Grade


 ***** ***** ***** ***** ***** 

 ***** ***** ***** ***** *****


📗 You could save the text in the above text box to a file using the button or copy and paste it into a file yourself
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##m: 3" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.








Last Updated: April 29, 2024 at 1:11 AM