Young Wu's Homepage

Prev: L6, Next: L8
Course Links: Canvas, Piazza, TopHat (212925)
Zoom Links: MW 4:00, TR 1:00, TR 2:30.

Tools

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 For visibility, you can resize all diagrams on page to have a maximum height that is percent of the screen height: .

📗 Calculator:

📗 Canvas:

Slide:

# Machine Learning

📗 A machine learning data set usually contains features (text, images, ... converted to numerical vectors) and labels (categories, converted to integers).

➩ Features: \(X = \left(x_{1}, x_{2}, ..., x_{n}\right)\), where \(x_{i} = \left(x_{i1}, x_{i2}, ..., x_{im}\right)\), and \(x_{ij}\) is called feature (or attribute) \(j\) of instance (or item) \(i\).

➩ Labels: \(Y = \left(y_{1}, y_{2}, ..., y_{n}\right)\), where \(y_{i}\) is the label of item \(i\).

📗 Supervised learning: given training set \(\left(X, Y\right)\), estimate a prediction function \(y \approx \hat{f}\left(x\right)\) to predict \(y' = \hat{f}\left(x'\right)\) based on a new item \(x'\).

📗 Unsupervised learning: given training set \(\left(X\right)\), put points into groups (discrete groups \(\left\{1, 2, ..., k\right\}\) or "continuous" lower dimensional representations).

📗 Reinforcement learning: given an environment with states \(x\) and reward \(R\left(x_{t}, y_{t}\right)\) when action \(y_{t}\) is performed in state \(x_{t}\), estimate the optimal policy \(y' = f\left(x'\right)\) that selects the best action in state \(x'\) that maximizes the total reward.

# Linear Classifier

📗 A simple classifier is a linear classifier: \(\hat{y}_{i} = 1\) if \(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b \geq 0\) and \(\hat{y} = 0\) otherwise. This classifier is called an LTU (Linear Threshold Unit) perceptron: Wikipedia.

📗 Given a training set, the weights \(w_{1}, w_{2}, ..., w_{m}\) and bias \(b\) can be estimated based on the data \(\left(x_{1}, y_{1}\right), \left(x_{2}, y_{2}\right), ..., \left(x_{n}, y_{n}\right)\). One algorithm is called the Perceptron Algorithm.

➩ Initialize random weights and bias.

➩ For each item \(x_{i}\), compute the prediction \(\hat{y}_{i}\).

➩ If prediction is \(\hat{y}_{i} = 0\) and the actual label is \(y_{i} = 1\), increase the weights by \(w \leftarrow w + \alpha x_{i}, b \leftarrow b + \alpha\), \(\alpha\) is a constant called learning rate.

➩ If prediction is \(\hat{y}_{i} = 1\) and the actual label is \(y_{i} = 0\), decrease the weights by \(w \leftarrow w - \alpha x_{i}, b \leftarrow b - \alpha\).

➩ Repeat the process until convergent (weights are no longer changing).

TopHat Discussion

ID:

📗 [3 points] Move the sliders below to change the green plane normal so that the largest number of the blue points are above the plane and the largest number of the red points are below the plane.

The current number of mistakes is ???.

📗 Answers:

\(w_{1}\) = 0
\(w_{2}\) = 0
\(w_{3}\) = 1
\(b\) = 0

# Perceptron Algorithm

📗 The percetron algorithm update can be summarized as \(w \leftarrow w - \alpha \left(\hat{y}_{i} - y_{i}\right) x_{i}\) (or for \(j = 1, 2, ..., m\), \(w_{j} \leftarrow w_{j} - \alpha \left(\hat{y}_{i} - y_{i}\right) x_{ij}\)) and \(b \leftarrow b - \alpha \left(\hat{y}_{i} - y_{i}\right)\), where \(\hat{y}_{i} = 1\) if \(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b \geq 0\) and \(\hat{y}_{i} = 0\) if \(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b < 0\): Wikipedia.

📗 The learning rate \(\alpha\) controls how fast the weights are updated.

➩ \(\alpha\) can be constant (usually 1).

➩ \(\alpha\) can be a function of the iteration (usually decreasing), for example, \(\alpha_{t} = \dfrac{1}{\sqrt{t}}\).

TopHat Discussion

ID:

📗 [3 points] Find the Perceptron weights by using the Perceptron algorithm: select a point on the diagram and click anywhere else to run one iteration of the Perceptron algorithm.

📗 You can set the learning rate here: .

📗 Answer: 0,0.1,0

TopHat Quiz

(Past Exam Question) ID:

📗 [4 points] Consider a Linear Threshold Unit (LTU) perceptron with initial weights \(w\) = and bias \(b\) = trained using the Perceptron Algorithm. Given a new input \(x\) = and \(y\) = . Let the learning rate be \(\alpha\) = , compute the updated weights, \(w', b'\) = :

📗 Answer (comma separated vector): .

# Activation Function

📗 More generally, a non-linear activation function can be added and the classifier will still be a linear classifier: \(\hat{y}_{i} = g\left(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m}\right)\) for some non-linear function \(g\) called the activation function: Wikipedia.

➩ LTU: the special case where \(g\left(z\right) = 1\) if \(z \geq 0\) and \(g\left(z\right) = 0\) otherwise.

➩ Linear regression (not commonly used for classification problems): \(g\left(z\right) = z\), usually truncated to a number between \(\left[0, 1\right]\) to represent probability that the predicted label is \(1\): Wikipedia.

➩ Logistic regression: \(g\left(z\right) = \dfrac{1}{1 + e^{-z}}\): Wikipedia.

📗 There are other activation functions often used in neural networks (multi-layer perceptrons):

➩ ReLU (REctified Linear Unit): \(g\left(z\right) = \displaystyle\max\left(0, z\right)\): Wikipedia.

➩ tanh (Hyperbolic TANgent): \(g\left(z\right) = \dfrac{e^{z} - e^{-z}}{e^{z} + e^{-z}}\).

Example

📗 Plot the function: as a function of from to .

# Loss Function

📗 The weights and biases need to be selected to minimize a loss (or cost) function \(C = C\left(\hat{y}_{1}, y_{1}\right) + C\left(\hat{y}_{2}, y_{2}\right) + ... + C\left(\hat{y}_{n}, y_{n}\right)\) or \(C = C\left(a_{1}, y_{1}\right) + C\left(a_{2}, y_{2}\right) + ... + C\left(a_{n}, y_{n}\right)\) (in case \(\hat{y}_{i}\) is a prediction of either \(0\) or \(1\) and \(a_{i}\) is a probability prediction in \(\left[0, 1\right]\)), the sum of loss from the prediction of each item: Wikipedia.

➩ Zero-one loss counts the number of mistakes: \(C\left(a_{i}, y_{i}\right) = 1\) if \(a_{i} \neq y_{i}\) and \(0\) otherwise.

➩ Square loss: \(C\left(a_{i}, y_{i}\right) = \dfrac{1}{2} \left(a_{i} - y_{i}\right)^{2}\).

➩ Hinge loss: \(C\left(a_{i}, y_{i}\right) = \displaystyle\max\left\{0, 1 - \left(2 a_{i} - 1\right) \cdot \left(2 y_{i} - 1\right)\right\}\): Wikipedia.

➩ Cross entropy loss: \(C\left(a_{i}, y_{i}\right) = -y_{i} \log\left(a_{i}\right) - \left(1 - y_{i}\right) \log\left(1 - a_{i}\right)\): Wikipedia.

TopHat Discussion

ID:

📗 [3 points] Move the sliders below to change the green plane normal so that the total loss from blue points below the plane and the red points above the plane is minimized.

The current total cost is ???.

📗 Answers:

\(w_{1}\) = 0
\(w_{2}\) = 0
\(w_{3}\) = 1
\(b\) = 0

TopHat Discussion

ID:

📗 [1 points] Change the logistic regression weights to minimize the cross entropy loss.

Weight x: 0
Weight y: 0
Bias: 0
Loss: ???, Zero-one

# Gradient Descent

📗 To minimize a loss function by choosing weights and biases, they can be initialized randomly and updated based on the derivative (gradient in the multi-variate case): Link, Wikipedia.

➩ If the derivative is negative and large: increase the weight by a large amount.

➩ If the derivative is negative and small: increase the weight by a small amount.

➩ If the derivative is positive and large: decrease the weight by a large amount.

➩ If the derivative is positive and small: decrease the weight by a small amount.

📗 This can be summarized by the gradient descent formula \(w_{j} \leftarrow w_{j} - \alpha \dfrac{\partial C}{\partial w_{j}}\) for \(j = 1, 2, ..., m\) and \(b \leftarrow b - \alpha \dfrac{\partial C}{\partial b}\).

➩ In multi-variate calculus notation, this is usually written as \(w \leftarrow w - \alpha \nabla_{w} C\) and \(b \leftarrow b - \alpha \dfrac{\partial C}{\partial b}\), where \(\nabla_{w} C = \begin{bmatrix} \dfrac{\partial C}{\partial w_{1}} \\ \dfrac{\partial C}{\partial w_{2}} \\ ... \\ \dfrac{\partial C}{\partial w_{m}} \end{bmatrix}\)

➩ \(\alpha\) is called the learning rate and controls how fast the weights are updated.

TopHat Discussion

📗 [1 points] Move the points to see the derivatives (slope of tangent line) of the function \(x^{2}\):

Point: 0
Learning rate: 0.5
Derivative: 0

# Gradient Descent for Logistic Regression

📗 The derivative \(\dfrac{\partial C\left(a_{i}, y_{i}\right)}{\partial w_{j}}\) for the cross entropy cost and the logistic activation function is \(\left(a_{i} - y_{i}\right) x_{ij}\).

📗 The gradient descent step can be written as \(w \leftarrow w - \alpha \left(\left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\right)\) and \(b \leftarrow b - \alpha \left(\left(a_{1} - y_{1}\right) + \left(a_{2} - y_{2}\right) + ... + \left(a_{n} - y_{n}\right)\right)\), which is similar to the Perceptron algorithm formula except for the derivatives for all items need to be summed up.

Math Note

If \(C_{i} = -y_{i} \log\left(a_{i}\right) - \left(1 - y_{i}\right) \log\left(1 - a_{i}\right)\) and \(a_{i} = \dfrac{1}{1 + e^{-z_{i}}}\) where \(z_{i} = w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b\), then the chain rule implies,
\(\dfrac{\partial C_{i}}{\partial w_{j}} = \dfrac{\partial C_{i}}{\partial a_{i}} \dfrac{\partial a_{i}}{\partial z_{i}} \dfrac{\partial z_{i}}{\partial w_{j}}\)
\(= \left(\dfrac{-y_{i}}{a_{i}} + \dfrac{1 - y_{i}}{1 - a_{i}}\right) \left(\dfrac{1}{1 + e^{-z_{i}}} \dfrac{e^{-z_{i}}}{1 + e^{-z_{i}}}\right) \left(x_{ij}\right)\)
\(= \dfrac{-y_{i} + a_{i} y_{i} + a_{i} - a_{i} y_{i}}{a_{i} \left(1 - a_{i}\right)} \left(a_{i} \left(1 - a_{i}\right)\right) \left(x_{ij}\right)\)
\(= \left(a_{i} - y_{i}\right) x_{ij}\),
combining the \(w_{j}\) for \(j = 1, 2, ..., m\), \(\nabla_{w} C_{i} = \left(a_{i} - y_{i}\right) x_{i}\),
and combining the \(C_{i}\) for \(i = 1, 2, ..., n\), \(\nabla_{w} C = \left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\).

TopHat Discussion

ID:

📗 [1 points] Change the regression coefficients to minimize the loss.

Weight: 0
Bias: 0
Loss: ???

TopHat Quiz

ID:

📗 [3 points] Which one of the following is the gradient descent step for w if the activation function is and the cost function is ?

📗 Choices:

\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
None of the above

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yingyu Liang, and Charles Dyer.

📗 Content from note blocks marked "optional" and content from Wikipedia and other demo links are helpful for understanding the materials, but will not be explicitly tested on the exams.

📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.

📗 Anonymous feedback can be submitted to: Form.

Prev: L6, Next: L8

Last Updated: November 03, 2025 at 1:01 PM