📗 Another simple classifier is a linear classifier: \(\hat{y}_{i} = 1\) if \(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b \geq 0\) and \(\hat{y} = 0\) otherwise. This classifier is called an LTU (Linear Threshold Unit) perceptron: Wikipedia.
📗 Given a training set, the weights \(w_{1}, w_{2}, ..., w_{m}\) and bias \(b\) can be estimated based on the data \(\left(x_{1}, y_{1}\right), \left(x_{2}, y_{2}\right), ..., \left(x_{n}, y_{n}\right)\). One algorithm is called the Perceptron Algorithm.
➩ Initialize random weights and bias.
➩ For each item \(x_{i}\), compute the prediction \(\hat{y}_{i}\).
➩ If prediction is \(\hat{y}_{i} = 0\) and the actual label is \(y_{i} = 1\), increase the weights by \(w \leftarrow w + \alpha x_{i}, b \leftarrow b + \alpha\), \(\alpha\) is a constant called learning rate.
➩ If prediction is \(\hat{y}_{i} = 1\) and the actual label is \(y_{i} = 0\), decrease the weights by \(w \leftarrow w - \alpha x_{i}, b \leftarrow b - \alpha\).
➩ Repeat the process until convergent (weights are no longer changing).
TopHat Discussion
ID:
📗 [3 points] Move the sliders below to change the green plane normal so that the largest number of the blue points are above the plane and the largest number of the red points are below the plane.
📗 The percetron algorithm update can be summarized as \(w \leftarrow w - \alpha \left(\hat{y}_{i} - y_{i}\right) x_{i}\) (or for \(j = 1, 2, ..., m\), \(w_{j} \leftarrow w_{j} - \alpha \left(\hat{y}_{i} - y_{i}\right) x_{ij}\)) and \(b \leftarrow b - \alpha \left(\hat{y}_{i} - y_{i}\right)\), where \(\hat{y}_{i} = 1\) if \(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b \geq 0\) and \(\hat{y}_{i} = 0\) if \(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m} + b < 0\): Wikipedia.
📗 The learning rate \(\alpha\) controls how fast the weights are updated.
➩ \(\alpha\) can be constant (usually 1).
➩ \(\alpha\) can be a function of the iteration (usually decreasing), for example, \(\alpha_{t} = \dfrac{1}{\sqrt{t}}\).
TopHat Discussion
ID:
📗 [3 points] Find the Perceptron weights by using the Perceptron algorithm: select a point on the diagram and click anywhere else to run one iteration of the Perceptron algorithm.
📗 You can set the learning rate here: .
📗 Answer: 0,0.1,0
TopHat Quiz
(Past Exam Question) ID:
📗 [4 points] Consider a Linear Threshold Unit (LTU) perceptron with initial weights \(w\) = and bias \(b\) = trained using the Perceptron Algorithm. Given a new input \(x\) = and \(y\) = . Let the learning rate be \(\alpha\) = , compute the updated weights, \(w', b'\) = :
📗 More generally, a non-linear activation function can be added and the classifier will still be a linear classifier: \(\hat{y}_{i} = g\left(w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m}\right)\) for some non-linear function \(g\) called the activation function: Wikipedia.
➩ LTU: the special case where \(g\left(z\right) = 1\) if \(z \geq 0\) and \(g\left(z\right) = 0\) otherwise.
➩ Linear regression (not commonly used for classification problems): \(g\left(z\right) = z\), usually truncated to a number between \(\left[0, 1\right]\) to represent probability that the predicted label is \(1\): Wikipedia.
📗 The weights and biases need to be selected to minimize a loss (or cost) function \(C = C\left(\hat{y}_{1}, y_{1}\right) + C\left(\hat{y}_{2}, y_{2}\right) + ... + C\left(\hat{y}_{n}, y_{n}\right)\) or \(C = C\left(a_{1}, y_{1}\right) + C\left(a_{2}, y_{2}\right) + ... + C\left(a_{n}, y_{n}\right)\) (in case \(\hat{y}_{i}\) is a prediction of either \(0\) or \(1\) and \(a_{i}\) is a probability prediction in \(\left[0, 1\right]\)), the sum of loss from the prediction of each item: Wikipedia.
➩ Zero-one loss counts the number of mistakes: \(C\left(a_{i}, y_{i}\right) = 1\) if \(a_{i} \neq y_{i}\) and \(0\) otherwise.
📗 [3 points] Move the sliders below to change the green plane normal so that the total loss from blue points below the plane and the red points above the plane is minimized.
📗 To minimize a loss function by choosing weights and biases, they can be initialized randomly and updated based on the derivative (gradient in the multi-variate case): Link, Wikipedia.
➩ If the derivative is negative and large: increase the weight by a large amount.
➩ If the derivative is negative and small: increase the weight by a small amount.
➩ If the derivative is positive and large: decrease the weight by a large amount.
➩ If the derivative is positive and small: decrease the weight by a small amount.
📗 This can be summarized by the gradient descent formula \(w_{j} \leftarrow w_{j} - \alpha \dfrac{\partial C}{\partial w_{j}}\) for \(j = 1, 2, ..., m\) and \(b \leftarrow b - \alpha \dfrac{\partial C}{\partial b}\).
➩ In multi-variate calculus notation, this is usually written as \(w \leftarrow w - \alpha \nabla_{w} C\) and \(b \leftarrow b - \alpha \dfrac{\partial C}{\partial b}\), where \(\nabla_{w} C = \begin{bmatrix} \dfrac{\partial C}{\partial w_{1}} \\ \dfrac{\partial C}{\partial w_{2}} \\ ... \\ \dfrac{\partial C}{\partial w_{m}} \end{bmatrix}\)
➩ \(\alpha\) is called the learning rate and controls how fast the weights are updated.
TopHat Discussion
📗 [1 points] Move the points to see the derivatives (slope of tangent line) of the function \(x^{2}\):
📗 The derivative \(\dfrac{\partial C\left(a_{i}, y_{i}\right)}{\partial w_{j}}\) for the cross entropy cost and the logistic activation function is \(\left(a_{i} - y_{i}\right) x_{ij}\).
📗 The gradient descent step can be written as \(w \leftarrow w - \alpha \left(\left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\right)\) and \(b \leftarrow b - \alpha \left(\left(a_{1} - y_{1}\right) + \left(a_{2} - y_{2}\right) + ... + \left(a_{n} - y_{n}\right)\right)\), which is similar to the Perceptron algorithm formula except for the derivatives for all items need to be summed up.
📗 [1 points] Change the regression coefficients to minimize the loss.
Weight: 0
Bias: 0
Loss: ??? 11slider
TopHat Quiz
ID:
📗 [3 points] Which one of the following is the gradient descent step for w if the activation function is and the cost function is ?
📗 Choices:
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
\(w = w - \alpha \displaystyle\sum_{i=1}^{n} \left(a_{i} - y_{i}\right)\)
None of the above
testzo,pw,ltu,log,lg,gd,rg,gfq
📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yingyu Liang, and Charles Dyer.
📗 Content from note blocks marked "optional" and content from Wikipedia and other demo links are helpful for understanding the materials, but will not be explicitly tested on the exams.
📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.
📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .
📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.