Young Wu's Homepage

📗 [4 points] Consider a Linear Threshold Unit (LTU) perceptron with initial weights \(w\) = and bias \(b\) = trained using the Perceptron Algorithm. Given a new input \(x\) = and \(y\) = . Let the learning rate be \(\alpha\) = , compute the updated weights, \(w', b'\) = :

Hint

See Spring 2018 Final Q7, Spring 2017 Final Q3. The perceptron learning formula using the notations in this question is: \(w' = w - \alpha \left(a - y\right) x\) and \(b' = b - \alpha \left(a - y\right)\) where \(a = 1_{\left\{w^\top x + b \geq 0\right\}}\). Note that this is not a gradient descent procedure: it just happens to use a similar formula.

📗 Answer (comma separated vector): .

📗 [6 points] With a linear threshold unit perceptron, implement the following function. That is, you should write down the weights \(w_{0}, w_{A}, w_{B}\). Enter the bias first, then the weights on A and B.

A	B	function
0	0
0	1
1	0
1	1

📗 You can plot your line given by \(w_{0}, w_{A}, w_{B}\) to see if it separates the dataset correctly: . If no green line shows up, it means the entire line is outside of the range [0, 1].

Hint

See Fall 2011 Final Q10, Spring 2018 Final Q4, Fall 2006 Final Q16, Fall 2005 Final Q16. There can many possible answers. A possible answer should satisfy the following conditions: \(1_{\left\{w_{0} + 0 w_{A} + 0 w_{B} \geq 0\right\}} = f\left(A = 0, B = 0\right)\), \(1_{\left\{w_{0} + 0 w_{A} + 1 w_{B}\right\}} = f\left(A = 0, B = 1\right)\), \(1_{\left\{w_{0} + 1 w_{A} + 0 w_{B}\right\}} = f\left(A = 1, B = 0\right)\), \(1_{\left\{w_{0} + 1 w_{A} + 1 w_{B}\right\}} = f\left(A = 1, B = 1\right)\).

📗 Answer (comma separated vector): .

📗 [3 points] What is the minimum number of training items that needs to be removed so that a Perceptron can learn the remaining training set (with accuracy 100 percent)?

\(x_{1}\)	\(x_{2}\)	\(x_{3}\)	\(y\)
0	0	0
0	0	1
0	1	0
0	1	1
1	0	0
1	0	1
1	1	0
1	1	1

Hint

See Fall 2019 Final Q9 Q10, Spring 2018 Fall Q17, Fall 2009 Final Q17 Q19, Fall 2008 Final Q3. A perceptron can learn only if the training set is linearly separable (i.e. separated by a plane).

📗 Answer: .

📗 [3 points] What is the minimum zero-one cost of a binary (y is either 0 or 1) linear (threshold) classifier (for example, LTU perceptron) on the following data set?

\(x_{i}\)	1	2	3	4	5	6
\(y_{i}\)

📗 A linear classifier is a vertical line that separates the two classes: you want to draw the line such that the least number of mistakes (i.e. zero-one cost) are made.

Hint

The zero-one cost is \(C = \displaystyle\min_{b} \displaystyle\sum_{i=1}^{n} 1_{\left\{\hat{y}_{i} \neq y_{i}\right\}}\), where \(\hat{y}_{i}\) is the prediction of the classifier. A linear classifier with threshold \(t\) is either in form (1) \(\hat{y}_{i} = 0\) when \(x_{i} \leq t\) and \(\hat{y}_{i} = 1\) when \(x_{i} > t\) or (2) \(\hat{y}_{i} = 1\) when \(x_{i} \leq t\) and \(\hat{y}_{i} = 0\) when \(x_{i} > t\). For this question, you can try \(t = 0, 1, 2, 3, 4, 5, 6, 7\) and check which one leads to the smallest zero-one cost.

📗 Answer: .

📗 [3 points] Let \(x = \left(x_{1}, x_{2}, x_{3}\right)\). We want to minimize the objective function \(f\left(x\right)\) = using gradient descent. Let the stepsize \(\eta\) = . If we start at the vector \(x^{\left(0\right)}\) = , what is the next vector \(x^{\left(1\right)}\) produced by gradient descent?

Hint

See Fall 2017 Final Q15, Fall 2010 Final Q5, Fall 2006 Midterm Q11, Fall 2005 Midterm Q5. The gradient descent formula using the notations in this question is: \(x^{\left(1\right)} = x^{\left(0\right)} - \eta \nabla f\left(x^{\left(0\right)}\right)\) where \(\nabla f\left(x^{\left(0\right)}\right) = \begin{bmatrix} \dfrac{\partial f}{\partial x^{\left(0\right)}_{1}} \\ \dfrac{\partial f}{\partial x^{\left(0\right)}_{2}} \\ \dfrac{\partial f}{\partial x^{\left(0\right)}_{3}} \end{bmatrix}\).

📗 Answer (comma separated vector): .

📗 [3 points] We use gradient descent to find the minimum of the function \(f\left(x\right)\) = with step size \(\eta > 0\). If we start from the point \(x_{0}\) = , how small should \(\eta\) be so we make progress in the first iteration? Check all values of \(\eta\) that make progress.

Hint

See Fall 2017 Final Q7, Fall 2014 Midterm Q17, Fall 2013 Final Q10. The minimum is 0, so "making progress" means getting closer to 0 in the first iteration. The gradient descent formula using the notations in this question is: \(x_{1} = x_{0} - \eta f'\left(x_{0}\right)\). The learning rate \(\eta\) that makes progress should satisfy \(\left| x_{0} - \eta f'\left(x_{0}\right) \right| < \left| x_{0} \right|\).

\(\eta\) = 0

📗 The green point is the current \(x_{0}\). You can change the \(\eta\) values using the slider and see the x value in the next iteration as the red point and check whether it gets closer to the minimum.

📗 Choices:

None of the above

📗 Calculator: .

📗 [3 points] Let \(g\left(z\right) = \dfrac{1}{1 + \exp\left(-z\right)}, z = w^\top x = w_{1} x_{1} + w_{2} x_{2} + ... + w_{d} x_{d}\), \(d\) = be a sigmoid perceptron with inputs \(x_{1} = ... = x_{d}\) = and weights \(w_{1} = ... = w_{d}\) = . There is no bias term. If the desired output is \(y\) = , and the sigmoid perceptron update rule has a learning rate of \(\alpha\) = , what will happen after one step of update? Each \(w_{i}\) will change by (enter a number, positive for increase and negative for decrease).

Hint

See Fall 2016 Final Q15, Fall 2011 Midterm Q11. The change for each \(w_{i}\) is \(-\alpha \left(a - y\right) x_{i}\) where \(a = g\left(z\right), z = w^\top x\). There is no bias added to the \(z\) term here.

📗 Answer: .

📗 [2 points] Consider a rectified linear unit (ReLU) with input \(x\) and a bias term. The output can be written as \(y\) = . Here, the weight is and the bias is . Write down the input value \(x\) that produces a specific output \(y\) = .

📗 The red curve is a plot of the activation function, given the y-value of the green point, the question is asking for its x-value.

Hint

See Fall 2017 Final Q23. If \(y > 0\), there is a unique \(x\) that solves \(y = \displaystyle\max\left(0, w_{0} + w_{1} x\right) = w_{0} + w_{1} x\). If \(y < 0\), there are no \(x\) that solves the expression. If \(y = 0\), the set of \(x\) that solves the expression is given by \(0 \geq w_{0} + w_{1} x\), you can find the largest and smallest value of this set.

📗 Answer: .

📗 [2 points] Consider a single sigmoid perceptron with bias weight \(w_{0}\) = , a single input \(x_{1}\) with weight \(w_{1}\) = , and the sigmoid activation function \(g\left(z\right) = \dfrac{1}{1 + \exp\left(-z\right)}\). For what input \(x_{1}\) does the perceptron output value \(a\) = .

📗 The red curve is a plot of the activation function, given the y-value of the green point, the question is asking for its x-value.

Hint

See Fall 2012 Final Q8, Fall 2014 Midterm Q16. Using the notations in this question: \(z = w_{0} + w_{1} x_{1}\) is the linear part, and \(a\) is the output or activation in the lectures. There should be a unique \(x_{1}\) that satisfies the expression \(\dfrac{1}{1 + \exp\left(-\left(w_{0} + w_{1} x_{1}\right)\right)} = a\).

📗 Note: Math.js does not accept "ln(...)", please use "log(...)" instead.

📗 Answer: .

📗 [3 points] Which ones of the following functions are equal to the squared error for deterministic binary classification? \(C = \displaystyle\sum_{i=1}^{n} \left(f\left(x_{i}\right) - y_{i}\right)^{2}, f\left(x_{i}\right) \in \left\{0, 1\right\}, y_{i} \in \left\{0, 1\right\}\). Note: \(I_{S}\) is the indicator notation on \(S\).

📗 Note: the question is asking for the functions that are identical in values.

Hint

For deterministic binary classifications, the both the predicted and actual labels \(f\left(x_{i}\right)\) and \(y_{i}\) are either \(0\) or \(1\). Therefore, just compare the values of the two functions for all four cases \(f\left(x_{i}\right) = 0, y_{i} = 0\), and \(f\left(x_{i}\right) = 1, y_{i} = 0\), and \(f\left(x_{i}\right) = 0, y_{i} = 1\), and \(f\left(x_{i}\right) = 1, y_{i} = 1\).

📗 Choices:

\(\displaystyle\sum_{i=1}^{n}\)
\(\displaystyle\sum_{i=1}^{n}\)
\(\displaystyle\sum_{i=1}^{n}\)
\(\displaystyle\sum_{i=1}^{n}\)
\(\displaystyle\sum_{i=1}^{n}\)
None of the above

📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the questions that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.

📗 Answer: .

# Warning: this is a replica of the homework page for testing purposes, please use M2 for homework submission.

# M2 Written (Math) Problems

# Warning: please enter your ID before you start!

# Question 1

# Question 2

# Question 3

# Question 4

# Question 5

# Question 6

# Question 7

# Question 8

# Question 9

# Question 10

# Question 11

# Grade