📗 The lecture is in person, but you can join Zoom: 8:50-9:40 or 11:00-11:50. Zoom recordings can be viewed on Canvas -> Zoom -> Cloud Recordings. They will be moved to Kaltura over the weekends.
📗 The in-class (participation) quizzes should be submitted on TopHat (Code:741565), but you can submit your answers through Form at the end of the lectures too.
📗 The Python notebooks used during the lectures can also be found on: GitHub. They will be updated weekly.
➩ A classifier is linear if the decision boundary is linear (line in 2D, plane in 3D, hyperplane in higher dimensions).
➩ Points above the plane (in the direction of the normal of the plane) are predicted as 1, and points below the plane are predicted as 0.
➩ A set of linear coefficients, usually estimated based on the training data, called weights, \(w_{1}, w_{2}, ..., w_{m}\), and a bias term \(b\), so that the classifier can be written in the form \(\hat{y} = 1\) if \(w_{1} x'_{1} + w_{2} x'_{2} + ... + w_{m} x'_{n} + b \geq 0\) and \(\hat{y} = 0\) if \(w_{1} x'_{1} + w_{2} x'_{2} + ... + w_{m} x'_{n} + b < 0\).
📗 Activation Function
➩ Sometimes, if a probabilistic prediction is needed, \(\hat{f}\left(x'\right) = g\left(w_{1} x'_{1} + w_{2} x'_{2} + ... + w_{m} x'_{m} + b\right)\) outputs a number between 0 and 1, and the classifier can be written in the form \(\hat{y} = 1\) if \(\hat{f}\left(x'\right) \geq 0.5\) and \(\hat{y} = 0\) if \(\hat{f}\left(x'\right) < 0.5\).
➩ The function \(g\) can be any non-linear function, the resulting classifier is still linear.
Examples of Linear Classifiers
➩ Linear threshold unit (LTU) perceptron: \(g\left(z\right) = 1\) if \(z > 0\) and \(g\left(z\right) = 0\) otherwise.
➩ Logistic regression: \(g\left(z\right) = \dfrac{1}{1 + e^{-z}}\), see Link.
➩ Support vector machine (SVM): \(g\left(z\right) = 1\) if \(z > 0\) and \(g\left(z\right) = 0\) otherwise, but with a different method to find the weights (that maximizes the separation between the two classes).
➩ Note: Nearest neighbors and decision trees have piecewise linear decision boundaries and is usually not considered linear classifiers.
➩ Note: Naive Bayes classifiers are not always linear under general distribution assumptions.
Decision Boundary Example
➩ Use various classifiers on randomly generated 2D datasets and compare the decision boundaries.
➩ Logistic regression is usually used for classification, not regression.
➩ The function \(g\left(z\right) = \dfrac{1}{1 + e^{-z}}\) is called the logistic activation function (or sigmoid function).
➩ How well the weights fit the training data is measured by a loss (or cost) function, which is the sum over the loss from each training item \(C\left(w\right) = C_{1}\left(w\right) + C_{2}\left(w\right) + ... + C_{n}\left(w\right)\), and for logistic regression, \(C_{i}\left(w\right) = - y_{i} \log g\left(z_{i}\right) - \left(1 - y_{i}\right) \log \left(1 - g\left(z_{i}\right)\right)\), \(z_{i} = w_{1} x_{i1} + w_{2} x_{i2} + ... + w_{m} x_{im} + b\), called the cross-entropy loss, is usually used.
➩ Given the weights and bias \(w = \left(w_{1}, w_{2}, ..., w_{m}\right), b\), the probabilistic prediction that a new item \(x' = \left(x'_{1}, x'_{2}, ..., x'_{m}\right)\) belongs to class 1 can be computed as \(\dfrac{1}{1 + e^{- \left(w_{1} x'_{1} + w_{2} x'_{2} + ... + w_{m} x'_{m} + b\right)}}\), or 1 / (1 + exp(- (x' @ w + b))), where @ is the dot product (or matrix product in general).
TopHat Discussion
ID:
Matrix Multiplication
➩ In numpy, v @ w computes the dot product between v and w, for example, numpy.array([a, b, c]) @ numpy.array([A, B, C]) means a * A + b * B + c * C.
➩ If v and w are matrices (2D array), then v @ w computes the matrix product, which is the dot product between the rows of v and the columns of w, for example, numpy.array([[a, b, c], [d, e, f], [g, h, i]]) @ numpy.array([[A, B, C], [D, E, F], [G, H, I]]) means numpy.array([[a * A + b * D + c * G, a * B + b * E + c * H, a * C + b * F + c * I], [c * A + d * D + e * G, c * B + d * E + e * H, c * C + d * F + e * I], [g * A + h * D + i * G, g * B + h * E + i * H, g * C + h * F + i * I]).
➩ In matrix notation, \(\begin{bmatrix} a & b & c \\ d & e & f \\ g & h & i \end{bmatrix} \begin{bmatrix} A & B & C \\ D & E & F \\ G & H & I \end{bmatrix}\) = \(\begin{bmatrix} a A + b D + c G & a B + b E + c H & a C + b F + c I \\ d A + e D + f G & d B + e E + f H & d C + e F + f I \\ g A + h D + i G & g B + h E + i H & g C + h F + i I \end{bmatrix}\).
➩ Matrix multiplications can be performed more efficiently (faster) than dot products looping over the columns and rows. More details about matrix operations in the Matrix Algebra and Linear Programming lectures.
Matrix Algebra Example
➩ Compare the probability predictions using sklearnpredict_proba vs manually computing the activation functions.
➩ The weight \(w_{j}\) in front of feature \(j\) can be interpreted as the increase in the log-odds of the label \(y\) being 1, associated with the increase of 1 unit in \(x_{j}\), holding all other variables constant.
➩ This means, if the feature \(x'_{j}\) increases by 1, then the odds of \(y\) being 1 is increased by \(e^{w_{j}}\).
➩ The bias \(b\) is the log-odds of \(y\) being 1, when \(x'_{1} = x'_{2} = ... = x'_{m} = 0\).
test1,2q
➩ Move the sliders below to change the green plane normal so that the largest number of the blue points are above the plane and the largest number of the red points are below the plane. The current number of mistakes is ???.
➩ Move the sliders below to change the green plane normal so that the total loss from blue points below the plane and the red points above the plane is minimized. The current total cost is ???.