Prev: L1, Next: L2

Zoom: Link, TopHat: Link, GoogleForm: Link, Piazza: Link, Feedback: Link.
Tools
📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .
📗 For visibility, you can resize all diagrams on page to have a maximum height that is percent of the screen height: .
📗 Calculator:
📗 Canvas:


Slide:




# Welcome to CS540

📗 The main components of the course include:
➩ Assignments and Projects (HW1-8 and CP1-4): 60%
➩ Exams: 40%
📗 *NEW* Competitive Projects (CP1-4):
➩ Participate in 2 or more competitions (out of 4 choices).
➩ Create your own training dataset, network architecture, and training algorithm (with strategic considerations).
➩ Strict deadline and format to submit the neural networks you trained.
➩ Grades based on ranking in class: in the past, the grades are curved at the end, and the assignment averages are close to perfect, so the ranking is effectively only based on the exams. Ranking-based assignments starting this summer will shift some of the weights from the exams back to the assignments, which is the more important part of the course.
📗 *NEW* Use of LLMs:
➩ Students are encouraged to generate code for assignments and projects and solve exam questions using Large Language Models (LLMs).
➩ Remember to give attribution and provide the prompts.
TopHat Discussion
📗 Why are you taking the course?
➩ Learn how to use AI tools like ChatGPT? This is not covered in the course.
➩ Learn how to program AI tools like ChatGPT? Only simple models.
➩ Learn the math and statistics behind AI algorithms? Yes, this is the focus of the course.
image




# ChatGPT

📗 GPT stands for Generative Pre-trained Transformer.
➩ Unsupervised learning (convert text to numerical vectors).
➩ Supervised learning: (1) discriminative (predict answers based on questions), (2) generative (predict next word based on previous word).
➩ Reinforcement learning (update model based on human feedback).
TopHat Discussion
📗 Have you used ChatGPT (or another Large Language Model)? What did you use LLM for?
➩ Solve homework or exam questions? For CS540, it is possible with some prompt engineering: Link.
➩ Write code for projects? For CS540, it is allowed and encouraged you use large language models (LLMs) to help with writing code (at the moment, most of LLMs cannot write complete projects).
➩ Write stories or create images? In the past, there were CS540 assignments asking students to use earlier versions of GPT to perform these tasks and compare the results with human creations.
➩ Other uses?



# Machine Learning

📗 A machine learning data set usually contains features (text, images, ... converted to numerical vectors) and labels (categories, converted to integers).
➩ Features: \(X = \left(x_{1}, x_{2}, ..., x_{n}\right)\), where \(x_{i} = \left(x_{i1}, x_{i2}, ..., x_{im}\right)\), and \(x_{ij}\) is called feature (or attribute) \(j\) of instance (or item) \(i\).
➩ Labels: \(Y = \left(y_{1}, y_{2}, ..., y_{n}\right)\), where \(y_{i}\) is the label of item \(i\).
📗 Supervised learning: given training set \(\left(X, Y\right)\), estimate a prediction function \(y \approx \hat{f}\left(x\right)\) to predict \(y' = \hat{f}\left(x'\right)\) based on a new item \(x'\).
📗 Unsupervised learning: given training set \(\left(X\right)\), put points into groups (discrete groups \(\left\{1, 2, ..., k\right\}\) or "continuous" lower dimensional representations).
📗 Reinforcement learning: given an environment with states \(x\) and reward \(R\left(x_{t}, y_{t}\right)\) when action \(y_{t}\) is performed in state \(x_{t}\), estimate the optimal policy \(y' = f\left(x'\right)\) that selects the best action in state \(x'\) that maximizes the total reward.



# Linear Classifier

📗 A simple classifier is a linear classifier: \(\hat{y}_{i} = 1\) if \(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b \geq 0\) and \(\hat{y} = 0\) otherwise. This classifier is called an LTU (Linear Threshold Unit) perceptron: Wikipedia.
📗 Given a training set, the weights \(w_{1}, w_{2}, ..., w_{m}\) and bias \(b\) can be estimated based on the data \(\left(x_{1}, y_{1}\right), \left(x_{2}, y_{2}\right), ..., \left(x_{n}, y_{n}\right)\). One algorithm is called the Perceptron Algorithm.
➩ Initialize random weights and bias.
➩ For each item \(x_{i}\), compute the prediction \(\hat{y}_{i}\).
➩ If prediction is \(\hat{y}_{i} = 0\) and the actual label is \(y_{i} = 1\), increase the weights by \(w \leftarrow w + \alpha x_{i}, b \leftarrow b + \alpha\), \(\alpha\) is a constant called learning rate.
➩ If prediction is \(\hat{y}_{i} = 1\) and the actual label is \(y_{i} = 0\), decrease the weights by \(w \leftarrow w - \alpha x_{i}, b \leftarrow b - \alpha\).
➩ Repeat the process until convergent (weights are no longer changing).
TopHat Discussion ID:
📗 [3 points] Move the sliders below to change the green plane normal so that the largest number of the blue points are above the plane and the largest number of the red points are below the plane.

The current number of mistakes is ???.
📗 Answers:
\(w_{1}\) = 0
\(w_{2}\) = 0
\(w_{3}\) = 1
\(b\) = 0



# Perceptron Algorithm

📗 The percetron algorithm update can be summarized as \(w \leftarrow w - \alpha \left(\hat{y}_{i} - y_{i}\right) x_{i}\) (or for \(j = 1, 2, ..., m\), \(w_{j} \leftarrow w_{j} - \alpha \left(\hat{y}_{i} - y_{i}\right) x_{ij}\)) and \(b \leftarrow b - \alpha \left(\hat{y}_{i} - y_{i}\right)\), where \(\hat{y}_{i} = 1\) if \(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b \geq 0\) and \(\hat{y}_{i} = 0\) if \(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b < 0\): Wikipedia.
📗 The learning rate \(\alpha\) controls how fast the weights are updated.
➩ \(\alpha\) can be constant (usually 1).
➩ \(\alpha\) can be a function of the iteration (usually decreasing), for example, \(\alpha_{t} = \dfrac{1}{\sqrt{t}}\).
TopHat Discussion ID:
📗 [3 points] Find the Perceptron weights by using the Perceptron algorithm: select a point on the diagram and click anywhere else to run one iteration of the Perceptron algorithm.
📗 You can set the learning rate here: .

📗 Answer: 0,0.1,0

TopHat Quiz (Past Exam Question) ID:
📗 [4 points] Consider a Linear Threshold Unit (LTU) perceptron with initial weights \(w\) = and bias \(b\) = trained using the Perceptron Algorithm. Given a new input \(x\) = and \(y\) = . Let the learning rate be \(\alpha\) = , compute the updated weights, \(w', b'\) = :
📗 Answer (comma separated vector): .




# Activation Function

📗 More generally, a non-linear activation function can be added and the classifier will still be a linear classifier: \(\hat{y}_{i} = g\left(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m}\right)\) for some non-linear function \(g\) called the activation function: Wikipedia.
➩ LTU: the special case where \(g\left(z\right) = 1\) if \(z \geq 0\) and \(g\left(z\right) = 0\) otherwise.
➩ Linear regression (not commonly used for classification problems): \(g\left(z\right) = z\), usually truncated to a number between \(\left[0, 1\right]\) to represent probability that the predicted label is \(1\): Wikipedia.
➩ Logistic regression: \(g\left(z\right) = \dfrac{1}{1 + e^{-z}}\): Wikipedia.
📗 There are other activation functions often used in neural networks (multi-layer perceptrons):
➩ ReLU (REctified Linear Unit): \(g\left(z\right) = \displaystyle\max\left(0, z\right)\): Wikipedia.
➩ tanh (Hyperbolic TANgent): \(g\left(z\right) = \dfrac{e^{z} - e^{-z}}{e^{z} + e^{-z}}\).
Example
📗 Plot the function: as a function of from to .




# Loss Function

📗 The weights and biases need to be selected to minimize a loss (or cost) function \(C = C\left(\hat{y}_{1}, y_{1}\right) + C\left(\hat{y}_{2}, y_{2}\right) + ... + C\left(\hat{y}_{n}, y_{n}\right)\) or \(C = C\left(a_{1}, y_{1}\right) + C\left(a_{2}, y_{2}\right) + ... + C\left(a_{n}, y_{n}\right)\) (in case \(\hat{y}_{i}\) is a prediction of either \(0\) or \(1\) and \(a_{i}\) is a probability prediction in \(\left[0, 1\right]\)), the sum of loss from the prediction of each item: Wikipedia.
➩ Zero-one loss counts the number of mistakes: \(C\left(a_{i}, y_{i}\right) = 1\) if \(a_{i} \neq y_{i}\) and \(0\) otherwise.
➩ Square loss: \(C\left(a_{i}, y_{i}\right) = \dfrac{1}{2} \left(a_{i} - y_{i}\right)^{2}\).
➩ Hinge loss: \(C\left(a_{i}, y_{i}\right) = \displaystyle\max\left\{0, 1 - \left(2 a_{i} - 1\right) \cdot \left(2 y_{i} - 1\right)\right\}\): Wikipedia.
➩ Cross entropy loss: \(C\left(a_{i}, y_{i}\right) = -y_{i} \log\left(a_{i}\right) - \left(1 - y_{i}\right) \log\left(1 - a_{i}\right)\): Wikipedia.
TopHat Discussion ID:
📗 [3 points] Move the sliders below to change the green plane normal so that the total loss from blue points below the plane and the red points above the plane is minimized.

The current total cost is ???.
📗 Answers:
\(w_{1}\) = 0
\(w_{2}\) = 0
\(w_{3}\) = 1
\(b\) = 0
TopHat Discussion ID:
📗 [1 points] Change the logistic regression weights to minimize the cross entropy loss.

Weight x: 0
Weight y: 0
Bias: 0
Loss: ???, Zero-one




# Gradient Descent

📗 To minimize a loss function by choosing weights and biases, they can be initialized randomly and updated based on the derivative (gradient in the multi-variate case): Link, Wikipedia.
➩ If the derivative is negative and large: increase the weight by a large amount.
➩ If the derivative is negative and small: increase the weight by a small amount.
➩ If the derivative is positive and large: decrease the weight by a large amount.
➩ If the derivative is positive and small: decrease the weight by a small amount.
📗 This can be summarized by the gradient descent formula \(w_{j} \leftarrow w_{j} - \alpha \dfrac{\partial C}{\partial w_{j}}\) for \(j = 1, 2, ..., m\) and \(b \leftarrow b - \alpha \dfrac{\partial C}{\partial b}\).
➩ In multi-variate calculus notation, this is usually written as \(w \leftarrow w - \alpha \nabla_{w} C\) and \(b \leftarrow b - \alpha \dfrac{\partial C}{\partial b}\), where \(\nabla_{w} C = \begin{bmatrix} \dfrac{\partial C}{\partial w_{1}} \\ \dfrac{\partial C}{\partial w_{2}} \\ ... \\ \dfrac{\partial C}{\partial w_{m}} \end{bmatrix}\)
➩ \(\alpha\) is called the learning rate and controls how fast the weights are updated.
TopHat Discussion
📗 [1 points] Move the points to see the derivatives (slope of tangent line) of the function \(x^{2}\):

Point: 0
Learning rate: 0.5
Derivative: 0




# Gradient Descent for Logistic Regression

📗 The derivative \(\dfrac{\partial C\left(a_{i}, y_{i}\right)}{\partial w_{j}}\) for the cross entropy cost and the logistic activation function is \(\left(a_{i} - y_{i}\right) x_{ij}\).
📗 The gradient descent step can be written as \(w \leftarrow w - \alpha \left(\left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\right)\) and \(b \leftarrow b - \alpha \left(\left(a_{1} - y_{1}\right) + \left(a_{2} - y_{2}\right) + ... + \left(a_{n} - y_{n}\right)\right)\), which is similar to the Perceptron algorithm formula except for the derivatives for all items need to be summed up.
Math Note If \(C_{i} = -y_{i} \log\left(a_{i}\right) - \left(1 - y_{i}\right) \log\left(1 - a_{i}\right)\) and \(a_{i} = \dfrac{1}{1 + e^{-z_{i}}}\) where \(z_{i} = w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b\), then the chain rule implies,
 \(\dfrac{\partial C_{i}}{\partial w_{j}} = \dfrac{\partial C_{i}}{\partial a_{i}} \dfrac{\partial a_{i}}{\partial z_{i}} \dfrac{\partial z_{i}}{\partial w_{j}}\)
 \(= \left(\dfrac{-y_{i}}{a_{i}} + \dfrac{1 - y_{i}}{1 - a_{i}}\right) \left(\dfrac{1}{1 + e^{-z_{i}}} \dfrac{e^{-z_{i}}}{1 + e^{-z_{i}}}\right) \left(x_{ij}\right)\)
 \(= \dfrac{-y_{i} + a_{i} y_{i} + a_{i} - a_{i} y_{i}}{a_{i} \left(1 - a_{i}\right)} \left(a_{i} \left(1 - a_{i}\right)\right) \left(x_{ij}\right)\)
 \(= \left(a_{i} - y_{i}\right) x_{ij}\),
combining the \(w_{j}\) for \(j = 1, 2, ..., m\), \(\nabla_{w} C_{i} = \left(a_{i} - y_{i}\right) x_{i}\),
and combining the \(C_{i}\) for \(i = 1, 2, ..., n\), \(\nabla_{w} C = \left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\).
TopHat Discussion ID:
📗 [1 points] Change the regression coefficients to minimize the loss.

Weight: 0
Bias: 0
Loss: ???

TopHat Quiz ID:
📗 [3 points] Which one of the following is the gradient descent step for w if the activation function is and the cost function is ?
📗 Choices:
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
None of the above




📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yudong Chen, Yingyu Liang, and Charles Dyer.
📗 Content from note blocks marked "optional" and content from Wikipedia and other demo links are helpful for understanding the materials, but will not be explicitly tested on the exams.
📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.
📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .
📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.
📗 Anonymous feedback can be submitted to: Form. Non-anonymous feedback and questions can be posted on Piazza: Link

Prev: L1, Next: L2





Last Updated: August 11, 2025 at 1:30 PM