Prev: L1, Next: L3

Zoom: Link, Piazza: Link, Google Form: Link.

Wisc ID for in-class quiz: (if your wisc email is "test@wisc.edu", please enter "test")
Token: (will be given during the lectures)




Slide:



# Activation Function

📗 More generally, a non-linear activation function can be added and the classifier will still be a linear classifier: \(\hat{y}_{i} = g\left(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m}\right)\) for some non-linear function \(g\) called the activation function: Wikipedia.
➩ LTU: the special case where \(g\left(z\right) = 1\) if \(z \geq 0\) and \(g\left(z\right) = 0\) otherwise.
➩ Linear regression (not commonly used for classification problems): \(g\left(z\right) = z\), usually truncated to a number between \(\left[0, 1\right]\) to represent probability that the predicted label is \(1\): Wikipedia.
➩ Logistic regression: \(g\left(z\right) = \dfrac{1}{1 + e^{-z}}\): Wikipedia.
📗 There are other activation functions often used in neural networks (multi-layer perceptrons):
➩ ReLU (REctified Linear Unit): \(g\left(z\right) = \displaystyle\max\left(0, z\right)\): Wikipedia.
➩ tanh (Hyperbolic TANgent): \(g\left(z\right) = \dfrac{e^{z} - e^{-z}}{e^{z} + e^{-z}}\).
Example
📗 Plot the function: as a function of from to .




# Loss Function

📗 The weights and biases need to be selected to minimize a loss (or cost) function \(C = C\left(\hat{y}_{1}, y_{1}\right) + C\left(\hat{y}_{2}, y_{2}\right) + ... + C\left(\hat{y}_{n}, y_{n}\right)\) or \(C = C\left(a_{1}, y_{1}\right) + C\left(a_{2}, y_{2}\right) + ... + C\left(a_{n}, y_{n}\right)\) (in case \(\hat{y}_{i}\) is a prediction of either \(0\) or \(1\) and \(a_{i}\) is a probability prediction in \(\left[0, 1\right]\)), the sum of loss from the prediction of each item: Wikipedia.
➩ Zero-one loss counts the number of mistakes: \(C\left(a_{i}, y_{i}\right) = 1\) if \(a_{i} \neq y_{i}\) and \(0\) otherwise.
➩ Square loss: \(C\left(a_{i}, y_{i}\right) = \dfrac{1}{2} \left(a_{i} - y_{i}\right)^{2}\).
➩ Hinge loss: \(C\left(a_{i}, y_{i}\right) = \displaystyle\max\left\{0, 1 - \left(2 a_{i} - 1\right) \cdot \left(2 y_{i} - 1\right)\right\}\): Wikipedia.
➩ Cross entropy loss: \(C\left(a_{i}, y_{i}\right) = -y_{i} \log\left(a_{i}\right) - \left(1 - y_{i}\right) \log\left(1 - a_{i}\right)\): Wikipedia.
In-class Discussion ID:
📗 [3 points] Move the sliders below to change the green plane normal so that the total loss from blue points below the plane and the red points above the plane is minimized.

The current total cost is ???.
📗 Answers:
\(w_{1}\) = 0
\(w_{2}\) = 0
\(w_{3}\) = 1
\(b\) = 0
[Note] How did you find the optimal weights?


 
Other students' answers:

In-class Discussion ID:
📗 [1 points] Change the logistic regression weights to minimize the cross entropy loss.

Weight x: 0
Weight y: 0
Bias: 0
Loss: ???, Zero-one

[Note] How did you find the optimal weights?


 
Other students' answers:




# Gradient Descent

📗 To minimize a loss function by choosing weights and biases, they can be initialized randomly and updated based on the derivative (gradient in the multi-variate case): Link, Wikipedia.
➩ If the derivative is negative and large: increase the weight by a large amount.
➩ If the derivative is negative and small: increase the weight by a small amount.
➩ If the derivative is positive and large: decrease the weight by a large amount.
➩ If the derivative is positive and small: decrease the weight by a small amount.
📗 This can be summarized by the gradient descent formula \(w_{j} \leftarrow w_{j} - \alpha \dfrac{\partial C}{\partial w_{j}}\) for \(j = 1, 2, ..., m\) and \(b \leftarrow b - \alpha \dfrac{\partial C}{\partial b}\).
➩ In multi-variate calculus notation, this is usually written as \(w \leftarrow w - \alpha \nabla_{w} C\) and \(b \leftarrow b - \alpha \dfrac{\partial C}{\partial b}\), where \(\nabla_{w} C = \begin{bmatrix} \dfrac{\partial C}{\partial w_{1}} \\ \dfrac{\partial C}{\partial w_{2}} \\ ... \\ \dfrac{\partial C}{\partial w_{m}} \end{bmatrix}\)
➩ \(\alpha\) is called the learning rate and controls how fast the weights are updated.
In-class Discussion
📗 [1 points] Move the point and change the learning rate to see the derivatives (slope of tangent line) of the function \(x^{2}\). Find an initial point + learning rate combination so that gradient descent will not find the global minimum.

Point: 0
Learning rate: 0.5
Derivative: 0
Point found after gradient descent: 0

[Note] In general, when would gradient descent not find the global minimum?


 
Other students' answers:

In-class Discussion
📗 [1 points] Move the point and change the learning rate to see the derivatives (slope of tangent line) of the function \(x^{2}\). Find an initial point + learning rate combination so that gradient descent will not find the global minimum.

Point: 0
Learning rate: 0.5
Derivative: 0
Point found after gradient descent: 0

[Note] What is the advantage of using logistic (sigmoid) loss over zero-one loss?


 
Other students' answers:




# Gradient Descent for Logistic Regression

📗 The derivative \(\dfrac{\partial C\left(a_{i}, y_{i}\right)}{\partial w_{j}}\) for the cross entropy cost and the logistic activation function is \(\left(a_{i} - y_{i}\right) x_{ij}\).
📗 The gradient descent step can be written as \(w \leftarrow w - \alpha \left(\left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\right)\) and \(b \leftarrow b - \alpha \left(\left(a_{1} - y_{1}\right) + \left(a_{2} - y_{2}\right) + ... + \left(a_{n} - y_{n}\right)\right)\), which is similar to the Perceptron algorithm formula except for the derivatives for all items need to be summed up.
Math Note If \(C_{i} = -y_{i} \log\left(a_{i}\right) - \left(1 - y_{i}\right) \log\left(1 - a_{i}\right)\) and \(a_{i} = \dfrac{1}{1 + e^{-z_{i}}}\) where \(z_{i} = w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b\), then the chain rule implies,
 \(\dfrac{\partial C_{i}}{\partial w_{j}} = \dfrac{\partial C_{i}}{\partial a_{i}} \dfrac{\partial a_{i}}{\partial z_{i}} \dfrac{\partial z_{i}}{\partial w_{j}}\)
 \(= \left(\dfrac{-y_{i}}{a_{i}} + \dfrac{1 - y_{i}}{1 - a_{i}}\right) \left(\dfrac{1}{1 + e^{-z_{i}}} \dfrac{e^{-z_{i}}}{1 + e^{-z_{i}}}\right) \left(x_{ij}\right)\)
 \(= \dfrac{-y_{i} + a_{i} y_{i} + a_{i} - a_{i} y_{i}}{a_{i} \left(1 - a_{i}\right)} \left(a_{i} \left(1 - a_{i}\right)\right) \left(x_{ij}\right)\)
 \(= \left(a_{i} - y_{i}\right) x_{ij}\),
combining the \(w_{j}\) for \(j = 1, 2, ..., m\), \(\nabla_{w} C_{i} = \left(a_{i} - y_{i}\right) x_{i}\),
and combining the \(C_{i}\) for \(i = 1, 2, ..., n\), \(\nabla_{w} C = \left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\).
In-class Discussion ID:
📗 [1 points] Change the regression coefficients to minimize the loss.

Weight: 0
Bias: 0
Loss: ???


 
Other students' answers:

In-class Quiz ID:
📗 [3 points] Which one of the following is the gradient descent step for w if the activation function is and the cost function is ?
📗 Choices:
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
None of the above
[Note] Use the space to explain the steps or just take notes:


 
Other students' answers:





# Questions?

📗 If you have questions, please use (i) Zoom chat, (ii) Piazza: Link, (iii) Office hours and discussion sessions. Please do NOT use Canvas mail and use email only to the course instructor (not TAs) for grading issues.
Additional In-class Discussion
📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:

Notes (not visible to other students):


Submit your answer to see other students answers (click the submit button to refresh): 

Additional In-class Quiz
📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:
A.
B.
C.
D.
E.
Notes (not visible to other students):


Submit your answer to see other students answers (click the submit button to refresh): 






# In-class Quiz Instructions

📗 To get full points on the in-class quizzes for a lecture:
➩ Submit relevant answers to the questions discussed during the lecture: incorrect answers are okay.
➩ Some questions require [notes] to earn the point.
➩ Some questions require special ID (given during the lecture) to earn the point.
➩ Do not submit answers to questions that are not discussed during the lectures. Each such submission will result in a deduction of one point.
➩ Submissions after the lecture, before the midterm (first 14 lectures) and the final exam (last 14 lectures), are accepted. After the exams, no in-class quiz submissions will be accepted.
➩ The grade on Canvas Assignment Q2 is computed as number of points divided by the number of questions asked (out of 1) and updated on Canvas every weekend.
📗 If there are any issues with submission on the website, please use this Google form: Link.
📗 Bonus point opportunities during a few lectures (added to in-class quiz above 20 points).
📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Blerina Gkotse, Yudong Chen, Yingyu Liang, Charles Dyer. Some content are generated using Copilot .

Prev: L1, Next: L3





Last Updated: June 26, 2026 at 3:06 AM