Young Wu's Homepage

Prev: L4, Next: L6 , Assignment: A4 , Practice Questions: M14 M15 M16 M17 , Links: Canvas, Piazza, Zoom, TopHat (453473)

Tools

📗 Calculator:

📗 Canvas:

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

Slide:

# Linear Classifier

📗 Another simple classifier is a linear classifier: \(\hat{y}_{i} = 1\) if \(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b \geq 0\) and \(\hat{y} = 0\) otherwise. This classifier is called an LTU (Linear Threshold Unit) perceptron: Wikipedia.

📗 Given a training set, the weights \(w_{1}, w_{2}, ..., w_{m}\) and bias \(b\) can be estimated based on the data \(\left(x_{1}, y_{1}\right), \left(x_{2}, y_{2}\right), ..., \left(x_{n}, y_{n}\right)\). One algorithm is called the Perceptron Algorithm.

➩ Initialize random weights and bias.

➩ For each item \(x_{i}\), compute the prediction \(\hat{y}_{i}\).

➩ If prediction is \(\hat{y}_{i} = 0\) and the actual label is \(y_{i} = 1\), increase the weights by \(w \leftarrow w + \alpha x_{i}, b \leftarrow b + \alpha\), \(\alpha\) is a constant called learning rate.

➩ If prediction is \(\hat{y}_{i} = 1\) and the actual label is \(y_{i} = 0\), decrease the weights by \(w \leftarrow w - \alpha x_{i}, b \leftarrow b - \alpha\).

➩ Repeat the process until convergent (weights are no longer changing).

TopHat Discussion

ID:

📗 [3 points] Move the sliders below to change the green plane normal so that the largest number of the blue points are above the plane and the largest number of the red points are below the plane.

The current number of mistakes is ???.

📗 Answers:

\(w_{1}\) = 0
\(w_{2}\) = 0
\(w_{3}\) = 1
\(b\) = 0

# Perceptron Algorithm

📗 The percetron algorithm update can be summarized as \(w \leftarrow w - \alpha \left(\hat{y}_{i} - y_{i}\right) x_{i}\) (or for \(j = 1, 2, ..., m\), \(w_{j} \leftarrow w_{j} - \alpha \left(\hat{y}_{i} - y_{i}\right) x_{ij}\)) and \(b \leftarrow b - \alpha \left(\hat{y}_{i} - y_{i}\right)\), where \(\hat{y}_{i} = 1\) if \(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b \geq 0\) and \(\hat{y}_{i} = 0\) if \(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b < 0\): Wikipedia.

📗 The learning rate \(\alpha\) controls how fast the weights are updated.

➩ \(\alpha\) can be constant (usually 1).

➩ \(\alpha\) can be a function of the iteration (usually decreasing), for example, \(\alpha_{t} = \dfrac{1}{\sqrt{t}}\).

TopHat Discussion

ID:

📗 [3 points] Find the Perceptron weights by using the Perceptron algorithm: select a point on the diagram and click anywhere else to run one iteration of the Perceptron algorithm.

📗 You can set the learning rate here: .

📗 Answer: 0,0.1,0

TopHat Quiz

(Past Exam Question) ID:

📗 [4 points] Consider a Linear Threshold Unit (LTU) perceptron with initial weights \(w\) = and bias \(b\) = trained using the Perceptron Algorithm. Given a new input \(x\) = and \(y\) = . Let the learning rate be \(\alpha\) = , compute the updated weights, \(w', b'\) = :

📗 Answer (comma separated vector): .

# Activation Function

📗 More generally, a non-linear activation function can be added and the classifier will still be a linear classifier: \(\hat{y}_{i} = g\left(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m}\right)\) for some non-linear function \(g\) called the activation function: Wikipedia.

➩ LTU: the special case where \(g\left(z\right) = 1\) if \(z \geq 0\) and \(g\left(z\right) = 0\) otherwise.

➩ Linear regression (not commonly used for classification problems): \(g\left(z\right) = z\), usually truncated to a number between \(\left[0, 1\right]\) to represent probability that the predicted label is \(1\): Wikipedia.

➩ Logistic regression: \(g\left(z\right) = \dfrac{1}{1 + e^{-z}}\): Wikipedia.

📗 There are other activation functions often used in neural networks (multi-layer perceptrons):

➩ ReLU (REctified Linear Unit): \(g\left(z\right) = \displaystyle\max\left(0, z\right)\): Wikipedia.

➩ tanh (Hyperbolic TANgent): \(g\left(z\right) = \dfrac{e^{z} - e^{-z}}{e^{z} + e^{-z}}\).

Example

📗 Plot the function: as a function of from to .

# Loss Function

📗 The weights and biases need to be selected to minimize a loss (or cost) function \(C = C\left(\hat{y}_{1}, y_{1}\right) + C\left(\hat{y}_{2}, y_{2}\right) + ... + C\left(\hat{y}_{n}, y_{n}\right)\) or \(C = C\left(a_{1}, y_{1}\right) + C\left(a_{2}, y_{2}\right) + ... + C\left(a_{n}, y_{n}\right)\) (in case \(\hat{y}_{i}\) is a prediction of either \(0\) or \(1\) and \(a_{i}\) is a probability prediction in \(\left[0, 1\right]\)), the sum of loss from the prediction of each item: Wikipedia.

➩ Zero-one loss counts the number of mistakes: \(C\left(a_{i}, y_{i}\right) = 1\) if \(a_{i} \neq y_{i}\) and \(0\) otherwise.

➩ Square loss: \(C\left(a_{i}, y_{i}\right) = \dfrac{1}{2} \left(a_{i} - y_{i}\right)^{2}\).

➩ Hinge loss: \(C\left(a_{i}, y_{i}\right) = \displaystyle\max\left\{0, 1 - \left(2 a_{i} - 1\right) \cdot \left(2 y_{i} - 1\right)\right\}\): Wikipedia.

➩ Cross entropy loss: \(C\left(a_{i}, y_{i}\right) = -y_{i} \log\left(a_{i}\right) - \left(1 - y_{i}\right) \log\left(1 - a_{i}\right)\): Wikipedia.

TopHat Discussion

ID:

📗 [3 points] Move the sliders below to change the green plane normal so that the total loss from blue points below the plane and the red points above the plane is minimized.

The current total cost is ???.

📗 Answers:

\(w_{1}\) = 0
\(w_{2}\) = 0
\(w_{3}\) = 1
\(b\) = 0

TopHat Discussion

ID:

📗 [1 points] Change the logistic regression weights to minimize the cross entropy loss.

Weight x: 0
Weight y: 0
Bias: 0
Loss: ???, Zero-one

# Gradient Descent

📗 To minimize a loss function by choosing weights and biases, they can be initialized randomly and updated based on the derivative (gradient in the multi-variate case): Link, Wikipedia.

➩ If the derivative is negative and large: increase the weight by a large amount.

➩ If the derivative is negative and small: increase the weight by a small amount.

➩ If the derivative is positive and large: decrease the weight by a large amount.

➩ If the derivative is positive and small: decrease the weight by a small amount.

📗 This can be summarized by the gradient descent formula \(w_{j} \leftarrow w_{j} - \alpha \dfrac{\partial C}{\partial w_{j}}\) for \(j = 1, 2, ..., m\) and \(b \leftarrow b - \alpha \dfrac{\partial C}{\partial b}\).

➩ In multi-variate calculus notation, this is usually written as \(w \leftarrow w - \alpha \nabla_{w} C\) and \(b \leftarrow b - \alpha \dfrac{\partial C}{\partial b}\), where \(\nabla_{w} C = \begin{bmatrix} \dfrac{\partial C}{\partial w_{1}} \\ \dfrac{\partial C}{\partial w_{2}} \\ ... \\ \dfrac{\partial C}{\partial w_{m}} \end{bmatrix}\)

➩ \(\alpha\) is called the learning rate and controls how fast the weights are updated.

TopHat Discussion

📗 [1 points] Move the points to see the derivatives (slope of tangent line) of the function \(x^{2}\):

Point: 0
Learning rate: 0.5
Derivative: 0

# Gradient Descent for Logistic Regression

📗 The derivative \(\dfrac{\partial C\left(a_{i}, y_{i}\right)}{\partial w_{j}}\) for the cross entropy cost and the logistic activation function is \(\left(a_{i} - y_{i}\right) x_{ij}\).

📗 The gradient descent step can be written as \(w \leftarrow w - \alpha \left(\left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\right)\) and \(b \leftarrow b - \alpha \left(\left(a_{1} - y_{1}\right) + \left(a_{2} - y_{2}\right) + ... + \left(a_{n} - y_{n}\right)\right)\), which is similar to the Perceptron algorithm formula except for the derivatives for all items need to be summed up.

Math Note

If \(C_{i} = -y_{i} \log\left(a_{i}\right) - \left(1 - y_{i}\right) \log\left(1 - a_{i}\right)\) and \(a_{i} = \dfrac{1}{1 + e^{-z_{i}}}\) where \(z_{i} = w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b\), then the chain rule implies,
\(\dfrac{\partial C_{i}}{\partial w_{j}} = \dfrac{\partial C_{i}}{\partial a_{i}} \dfrac{\partial a_{i}}{\partial z_{i}} \dfrac{\partial z_{i}}{\partial w_{j}}\)
\(= \left(\dfrac{-y_{i}}{a_{i}} + \dfrac{1 - y_{i}}{1 - a_{i}}\right) \left(\dfrac{1}{1 + e^{-z_{i}}} \dfrac{e^{-z_{i}}}{1 + e^{-z_{i}}}\right) \left(x_{ij}\right)\)
\(= \dfrac{-y_{i} + a_{i} y_{i} + a_{i} - a_{i} y_{i}}{a_{i} \left(1 - a_{i}\right)} \left(a_{i} \left(1 - a_{i}\right)\right) \left(x_{ij}\right)\)
\(= \left(a_{i} - y_{i}\right) x_{ij}\),
combining the \(w_{j}\) for \(j = 1, 2, ..., m\), \(\nabla_{w} C_{i} = \left(a_{i} - y_{i}\right) x_{i}\),
and combining the \(C_{i}\) for \(i = 1, 2, ..., n\), \(\nabla_{w} C = \left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\).

TopHat Discussion

ID:

📗 [1 points] Change the regression coefficients to minimize the loss.

Weight: 0
Bias: 0
Loss: ???

TopHat Quiz

ID:

📗 [3 points] Which one of the following is the gradient descent step for w if the activation function is and the cost function is ?

📗 Choices:

\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
None of the above

# Classification vs Regression

📗 Classification is the problem when \(y\) is categorical.

➩ When \(y = \left\{0, 1\right\}\), the problem is binary classification.

➩ When \(y = \left\{0, 1, ..., K\right\}\), the problem is multi-class classification.

📗 Regression is the problem when \(y\) is continuous.

➩ Logistic regression is usually used for classification problems, but since it predicts a continuous \(y\) in \(\left[0, 1\right]\), or the probability that \(y\) is in class \(1\), it is called "regression".

# Linear Regression

📗 The regression coefficients are usually estimated by \(w = \left(X^\top X\right)^{-1} X^\top y\), where \(X\) is the design matrix whose rows are the items and the columns are the features (a column of \(1\)s can be added so that the corresponding weight is the bias): Wikipedia.

📗 Gradient descent can be used with squared loss and the weights should converge to the same estimates: \(w = w - \alpha \left(\left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\right)\) and \(b = b - \alpha \left(\left(a_{1} - y_{1}\right) + \left(a_{2} - y_{2}\right) + ... + \left(a_{n} - y_{n}\right)\right)\).

Math Note

📗 The coefficients are sometimes derived from \(y = X w\), implying \(X^\top y = X^\top X w\) or \(w = \left(X^\top X\right)^{-1} X^\top y\).

📗 Gradient descent step is given by \(C_{i} = \dfrac{1}{2} \left(a_{i} - y_{i}\right)^{2}\) and \(a_{i} = w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m}\), so,

\(\dfrac{\partial C_{i}}{\partial w_{j}} = \dfrac{\partial C_{i}}{\partial a_{i}} \dfrac{\partial a_{i}}{\partial w_{j}}\)
\(= \left(a_{i} - y_{i}\right) x_{ij}\),
combining the \(w_{j}\) for \(j = 1, 2, ..., m\), \(\nabla_{w} C_{i} = \left(a_{i} - y_{i}\right) x_{i}\),
and combining the \(C_{i}\) for \(i = 1, 2, ..., n\), \(\nabla_{w} C = \left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\).

TopHat Discussion

ID:

📗 [1 points] Change the regression coefficients to minimize the loss.

Weight: 0
Bias: 0
Loss: ???

# Linear Probability Model

📗 Using linear regression to estimate probabilities is inappropriate, since the loss \(\dfrac{1}{2} \left(a_{i} - y_{i}\right)^{2}\) is large when \(y_{i} = 1\) and \(a_{i} > 1\) or \(y_{i} = 0\) and \(a_{i} < 0\), so linear regression is penalizing predictions that are "very correct": Wikipedia.

📗 Using linear regression and rounding \(y\) to the nearest integer is also inappropriate for multi-class classification, since the classes should not be ordered by their labels.

TopHat Discussion

ID:

📗 [1 points] Change the regression coefficients to minimize the loss.

Weight: 0
Bias: 0
Loss: ???

# Model Interpretation

📗 Given a new item \(x_{i'}\) (indexed by \(i'\)) with features \(\left(x_{i' 1}, x_{i' 2}, ..., x_{i' m}\right)\), the predicted \(y_{i'}\) is given by \(\hat{y}_{i'} = w_{1} x_{i' 1} + w_{2} x_{i' 2} + ... + w_{m} x_{i' m} + b\).

📗 The weight (coefficient) for feature \(j\) is usually interpreted as the expected (average) change in \(y_{i'}\) when \(x_{i' j}\) increases by one unit with the other features held constant.

📗 The bias (intercept) is usually interpreted as the expected (average) value of \(y\) when all features have value \(0\), or \(x_{i' 1} = x_{i' 2} = ... = x_{i' m} = 0\).

➩ This interpretation assumes that \(0\) is a valid value for all features (or \(0\) is in the range of all features).

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yingyu Liang, and Charles Dyer.

📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.

📗 Anonymous feedback can be submitted to: Form.

Prev: L4, Next: L6

Last Updated: April 09, 2025 at 11:29 PM