Wisc ID for in-class quiz: (if your wisc email is "test@wisc.edu", please enter "test")
Token: (will be given during the lectures) 2 id,answer_id;token,answer_check
📗 More generally, a non-linear activation function can be added and the classifier will still be a linear classifier: \(\hat{y}_{i} = g\left(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m}\right)\) for some non-linear function \(g\) called the activation function: Wikipedia.
➩ LTU: the special case where \(g\left(z\right) = 1\) if \(z \geq 0\) and \(g\left(z\right) = 0\) otherwise.
➩ Linear regression (not commonly used for classification problems): \(g\left(z\right) = z\), usually truncated to a number between \(\left[0, 1\right]\) to represent probability that the predicted label is \(1\): Wikipedia.
📗 The weights and biases need to be selected to minimize a loss (or cost) function \(C = C\left(\hat{y}_{1}, y_{1}\right) + C\left(\hat{y}_{2}, y_{2}\right) + ... + C\left(\hat{y}_{n}, y_{n}\right)\) or \(C = C\left(a_{1}, y_{1}\right) + C\left(a_{2}, y_{2}\right) + ... + C\left(a_{n}, y_{n}\right)\) (in case \(\hat{y}_{i}\) is a prediction of either \(0\) or \(1\) and \(a_{i}\) is a probability prediction in \(\left[0, 1\right]\)), the sum of loss from the prediction of each item: Wikipedia.
➩ Zero-one loss counts the number of mistakes: \(C\left(a_{i}, y_{i}\right) = 1\) if \(a_{i} \neq y_{i}\) and \(0\) otherwise.
📗 [3 points] Move the sliders below to change the green plane normal so that the total loss from blue points below the plane and the red points above the plane is minimized.
📗 To minimize a loss function by choosing weights and biases, they can be initialized randomly and updated based on the derivative (gradient in the multi-variate case): Link, Wikipedia.
➩ If the derivative is negative and large: increase the weight by a large amount.
➩ If the derivative is negative and small: increase the weight by a small amount.
➩ If the derivative is positive and large: decrease the weight by a large amount.
➩ If the derivative is positive and small: decrease the weight by a small amount.
📗 This can be summarized by the gradient descent formula \(w_{j} \leftarrow w_{j} - \alpha \dfrac{\partial C}{\partial w_{j}}\) for \(j = 1, 2, ..., m\) and \(b \leftarrow b - \alpha \dfrac{\partial C}{\partial b}\).
➩ In multi-variate calculus notation, this is usually written as \(w \leftarrow w - \alpha \nabla_{w} C\) and \(b \leftarrow b - \alpha \dfrac{\partial C}{\partial b}\), where \(\nabla_{w} C = \begin{bmatrix} \dfrac{\partial C}{\partial w_{1}} \\ \dfrac{\partial C}{\partial w_{2}} \\ ... \\ \dfrac{\partial C}{\partial w_{m}} \end{bmatrix}\)
➩ \(\alpha\) is called the learning rate and controls how fast the weights are updated.
In-class Discussion
📗 [1 points] Move the point and change the learning rate to see the derivatives (slope of tangent line) of the function \(x^{2}\). Find an initial point + learning rate combination so that gradient descent will not find the global minimum.
Point: 0
Learning rate: 0.5
Derivative: 0
Point found after gradient descent: 0 1slider
[Note] In general, when would gradient descent not find the global minimum?
Other students' answers:
In-class Discussion
📗 [1 points] Move the point and change the learning rate to see the derivatives (slope of tangent line) of the function \(x^{2}\). Find an initial point + learning rate combination so that gradient descent will not find the global minimum.
Point: 0
Learning rate: 0.5
Derivative: 0
Point found after gradient descent: 0 1slider
[Note] What is the advantage of using logistic (sigmoid) loss over zero-one loss?
📗 The derivative \(\dfrac{\partial C\left(a_{i}, y_{i}\right)}{\partial w_{j}}\) for the cross entropy cost and the logistic activation function is \(\left(a_{i} - y_{i}\right) x_{ij}\).
📗 The gradient descent step can be written as \(w \leftarrow w - \alpha \left(\left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\right)\) and \(b \leftarrow b - \alpha \left(\left(a_{1} - y_{1}\right) + \left(a_{2} - y_{2}\right) + ... + \left(a_{n} - y_{n}\right)\right)\), which is similar to the Perceptron algorithm formula except for the derivatives for all items need to be summed up.
📗 If you have questions, please use (i) Zoom chat, (ii) Piazza: Link, (iii) Office hours and discussion sessions. Please do NOT use Canvas mail and use email only to the course instructor (not TAs) for grading issues.
Additional In-class Discussion
📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:
Notes (not visible to other students):
Submit your answer to see other students answers (click the submit button to refresh):
Additional In-class Quiz
📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:
A.
B.
C.
D.
E.
Notes (not visible to other students):
Submit your answer to see other students answers (click the submit button to refresh):
📗 To get full points on the in-class quizzes for a lecture:
➩ Submit relevant answers to the questions discussed during the lecture: incorrect answers are okay.
➩ Some questions require [notes] to earn the point.
➩ Some questions require special ID (given during the lecture) to earn the point.
➩ Do not submit answers to questions that are not discussed during the lectures. Each such submission will result in a deduction of one point.
➩ Submissions after the lecture, before the midterm (first 14 lectures) and the final exam (last 14 lectures), are accepted. After the exams, no in-class quiz submissions will be accepted.
➩ The grade on Canvas Assignment Q2 is computed as number of points divided by the number of questions asked (out of 1) and updated on Canvas every weekend.
📗 If there are any issues with submission on the website, please use this Google form: Link.
📗 Bonus point opportunities during a few lectures (added to in-class quiz above 20 points).
📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Blerina Gkotse, Yudong Chen, Yingyu Liang, Charles Dyer. Some content are generated using Copilot .