Young Wu's Homepage

📗 [3 points] We use gradient descent to find the minimum of the function \(f\left(x\right)\) = with step size \(\eta > 0\). If we start from the point \(x_{0}\) = , how small should \(\eta\) be so we make progress in the first iteration? Check all values of \(\eta\) that make progress.

Hint

See Fall 2017 Final Q7, Fall 2014 Midterm Q17, Fall 2013 Final Q10. The minimum is 0, so "making progress" means getting closer to 0 in the first iteration. The gradient descent formula using the notations in this question is: \(x_{1} = x_{0} - \eta f'\left(x_{0}\right)\). The learning rate \(\eta\) that makes progress should satisfy \(\left| x_{0} - \eta f'\left(x_{0}\right) \right| < \left| x_{0} \right|\).

\(\eta\) = 0

📗 The green point is the current \(x_{0}\). You can change the \(\eta\) values using the slider and see the x value in the next iteration as the red point and check whether it gets closer to the minimum.

📗 Choices:

None of the above

📗 Calculator: .

📗 [3 points] Let \(x = \left(x_{1}, x_{2}, x_{3}\right)\). We want to minimize the objective function \(f\left(x\right)\) = using gradient descent. Let the stepsize \(\eta\) = . If we start at the vector \(x^{\left(0\right)}\) = , what is the next vector \(x^{\left(1\right)}\) produced by gradient descent?

Hint

See Fall 2017 Final Q15, Fall 2010 Final Q5, Fall 2006 Midterm Q11, Fall 2005 Midterm Q5. The gradient descent formula using the notations in this question is: \(x^{\left(1\right)} = x^{\left(0\right)} - \eta \nabla f\left(x^{\left(0\right)}\right)\) where \(\nabla f\left(x^{\left(0\right)}\right) = \begin{bmatrix} \dfrac{\partial f}{\partial x^{\left(0\right)}_{1}} \\ \dfrac{\partial f}{\partial x^{\left(0\right)}_{2}} \\ \dfrac{\partial f}{\partial x^{\left(0\right)}_{3}} \end{bmatrix}\).

📗 Answer (comma separated vector): .

📗 [2 points] Consider a rectified linear unit (ReLU) with input \(x\) and a bias term. The output can be written as \(y\) = . Here, the weight is and the bias is . Write down the input value \(x\) that produces a specific output \(y\) = .

📗 The red curve is a plot of the activation function, given the y-value of the green point, the question is asking for its x-value.

Hint

See Fall 2017 Final Q23. If \(y > 0\), there is a unique \(x\) that solves \(y = \displaystyle\max\left(0, w_{0} + w_{1} x\right) = w_{0} + w_{1} x\). If \(y < 0\), there are no \(x\) that solves the expression. If \(y = 0\), the set of \(x\) that solves the expression is given by \(0 \geq w_{0} + w_{1} x\), you can find the largest and smallest value of this set.

📗 Answer: .

📗 [4 points] Consider a Linear Threshold Unit (LTU) perceptron with initial weights \(w\) = and bias \(b\) = trained using the Perceptron Algorithm. Given a new input \(x\) = and \(y\) = . Let the learning rate be \(\alpha\) = , compute the updated weights, \(w', b'\) = :

Hint

See Spring 2018 Final Q7, Spring 2017 Final Q3. The perceptron learning formula using the notations in this question is: \(w' = w - \alpha \left(a - y\right) x\) and \(b' = b - \alpha \left(a - y\right)\) where \(a = 1_{\left\{w^\top x + b \geq 0\right\}}\). Note that this is not a gradient descent procedure: it just happens to use a similar formula.

📗 Answer (comma separated vector): .

📗 [3 points] Let \(g\left(z\right) = \dfrac{1}{1 + \exp\left(-z\right)}, z = w^\top x = w_{1} x_{1} + w_{2} x_{2} + ... + w_{d} x_{d}\), \(d\) = be a sigmoid perceptron with inputs \(x_{1} = ... = x_{d}\) = and weights \(w_{1} = ... = w_{d}\) = . There is no bias term. If the desired output is \(y\) = , and the sigmoid perceptron update rule has a learning rate of \(\alpha\) = , what will happen after one step of update? Each \(w_{i}\) will change by (enter a number, positive for increase and negative for decrease).

Hint

See Fall 2016 Final Q15, Fall 2011 Midterm Q11. The change for each \(w_{i}\) is \(-\alpha \left(a - y\right) x_{i}\) where \(a = g\left(z\right), z = w^\top x\). There is no bias added to the \(z\) term here.

📗 Answer: .

📗 [3 points] Which functions are (weakly) convex on \(\mathbb{R}\)?

Hint

See Fall 2014 Final Q4, Fall 2005 Midterm Q6. Either plot the functions or find the ones with non-negative second derivative (i.e. positive semi-definite Hessian matrix in higher dimensions). Strictly convex functions are the functions with strictly positive second derivative. Weakly convex functions are the functions with weakly positive (i.e. non-negative) second derivative. You can see the formal definitions on Wikipedia.

📗 You can plot an expression of x: using from to .

📗 Choices:

None of the above

📗 [2 points] Consider a single sigmoid perceptron with bias weight \(w_{0}\) = , a single input \(x_{1}\) with weight \(w_{1}\) = , and the sigmoid activation function \(g\left(z\right) = \dfrac{1}{1 + \exp\left(-z\right)}\). For what input \(x_{1}\) does the perceptron output value \(a\) = .

📗 The red curve is a plot of the activation function, given the y-value of the green point, the question is asking for its x-value.

Hint

See Fall 2012 Final Q8, Fall 2014 Midterm Q16. Using the notations in this question: \(z = w_{0} + w_{1} x_{1}\) is the linear part, and \(a\) is the output or activation in the lectures. There should be a unique \(x_{1}\) that satisfies the expression \(\dfrac{1}{1 + \exp\left(-\left(w_{0} + w_{1} x_{1}\right)\right)} = a\).

📗 Note: Math.js does not accept "ln(...)", please use "log(...)" instead.

📗 Answer: .

📗 [6 points] With a linear threshold unit perceptron, implement the following function. That is, you should write down the weights \(w_{0}, w_{A}, w_{B}\). Enter the bias first, then the weights on A and B.

A	B	function
0	0
0	1
1	0
1	1

📗 You can plot your line given by \(w_{0}, w_{A}, w_{B}\) to see if it separates the dataset correctly: . If no green line shows up, it means the entire line is outside of the range [0, 1].

Hint

See Fall 2011 Final Q10, Spring 2018 Final Q4, Fall 2006 Final Q16, Fall 2005 Final Q16. There can many possible answers. A possible answer should satisfy the following conditions: \(1_{\left\{w_{0} + 0 w_{A} + 0 w_{B} \geq 0\right\}} = f\left(A = 0, B = 0\right)\), \(1_{\left\{w_{0} + 0 w_{A} + 1 w_{B}\right\}} = f\left(A = 0, B = 1\right)\), \(1_{\left\{w_{0} + 1 w_{A} + 0 w_{B}\right\}} = f\left(A = 1, B = 0\right)\), \(1_{\left\{w_{0} + 1 w_{A} + 1 w_{B}\right\}} = f\left(A = 1, B = 1\right)\).

📗 Answer (comma separated vector): .

📗 [3 points] What is the minimum zero-one cost of a binary (y is either 0 or 1) linear (threshold) classifier (for example, LTU perceptron) on the following data set?

\(x_{i}\)	1	2	3	4	5	6
\(y_{i}\)

📗 A linear classifier is a vertical line that separates the two classes: you want to draw the line such that the least number of mistakes (i.e. zero-one cost) are made.

Hint

The zero-one cost is \(C = \displaystyle\min_{b} \displaystyle\sum_{i=1}^{n} 1_{\left\{\hat{y}_{i} \neq y_{i}\right\}}\), where \(\hat{y}_{i}\) is the prediction of the classifier. A linear classifier with threshold \(t\) is either in form (1) \(\hat{y}_{i} = 0\) when \(x_{i} \leq t\) and \(\hat{y}_{i} = 1\) when \(x_{i} > t\) or (2) \(\hat{y}_{i} = 1\) when \(x_{i} \leq t\) and \(\hat{y}_{i} = 0\) when \(x_{i} > t\). For this question, you can try \(t = 0, 1, 2, 3, 4, 5, 6, 7\) and check which one leads to the smallest zero-one cost.

📗 Answer: .

📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the questions that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.

📗 Answer: .

# M2 Written (Math) Problems

# Warning: please enter your ID before you start!

# Question 1

# Question 2

# Question 3

# Question 4

# Question 5

# Question 6

# Question 7

# Question 8

# Question 9

# Question 10

# Grade

# Submission

# Solutions