📗 Non-linear classifiers are classifiers with non-linear decision boundaries.
➩ Non-linear models are difficult to estimate directly in general.
➩ Two ways of creating non-linear classifiers are,
(1) Non-linear transformations of the features (for example, kernel support vector machine)
(2) Combining multiple copies of linear classifiers (for example, neural network, decision tree)
📗 sklearn.svm.SVC can be used to train kernel SVMs, possibly infinite number of new features, efficiently through dual optimization (more detail about this in the Linear Programming lecture): Doc
➩ Available kernel functions include: linear (no new features), polynomial (degree d polynomial features), rbf (Radial Basis Function, infinite number of new features).
Kernel Trick Example
ID:
➩ Transform the points (using the kernel) and move the plane such that the plane separates the two classes.
📗 Neural networks (also called multilayer perceptron) can be viewed as multiple layers of logistic regressions (or perceptrons with other activation functions).
➩ The outputs of the previous layers are used as the inputs in the next layer.
➩ The layers in between the inputs \(x\) and output \(y\) are hidden layers and can be viewed as additional internal features generated by the neural network.
➩ sklearn.neural_network.MLPClassifier can be used to train fully connect neural networks without convolutional layers or transformer modules. The activation functions logistic, tanh, and relu can be used: Doc
➩ PyTorch is a popular package for training more general neural networks with special layers and modules, and with custom activation functions: Link.
TopHat Activity
➩ Compare neural networks with different architecture (number of hidden layers, units), and different activation functions (ReLU, tanh, Sigmoud (logistic)) here: Link.
➩ Discuss how they behave differently on different datasets, for example, training speed, decision boundary, number of non-zero weights, etc.
📗 Logistic regression, neural network, and linear regression compute the best weights and biases by solving an optimization problem: \(\displaystyle\min_{w,b} C\left(w, b\right)\), where \(C\) is the loss (cost) function that measures the amount of error the model is making.
➩ The search strategy is to start with a random set of weights and biases and iteratively move to another set of weights and biases with a lower loss: Link.
📗 For a single-variable function \(C\left(w\right)\), if the derivative at \(w\), \(C'\left(w\right)\), is positive then decreasing \(w\) would decrease \(C\left(w\right)\), and if the derivative at \(w\) is negative then increasing \(w\) would decrease \(C\left(w\right)\).
\(C'\left(w\right)\) sign
\(\left| C'\left(w\right) \right|\) magnitude
How to decrease \(C\left(w\right)\)
Positive
Small
Decrease \(w\) by a little
Positive
Large
Decrease \(w\) by a lot
Negative
Small
Increase \(w\) by a little
Negative
Large
Increase \(w\) by a lot
➩ In general, \(w = w - \alpha C'\left(w\right)\) would update \(w\) to decrease \(C\left(w\right)\) by the largest amount.
Derivative Example
➩ Move the points to see the derivatives (slope of tangent line) of the function \(x^{2}\):
📗 If there are more than one features, the vector of derivatives, one for each weight, is called the gradient vector, denoted by \(\nabla_{w} C\) = \(\begin{bmatrix} \dfrac{\partial C}{\partial w_{1}} \\ \dfrac{\partial C}{\partial w_{2}} \\ ... \\ \dfrac{\partial C}{\partial w_{m}} \end{bmatrix}\), pronounced as "the gradient of C with respect to w" or "D w C" (not "Delta w C").
➩ The gradient vector represent rate and direction of the fastest change.
➩ \(w = w - \alpha \nabla_{w} C\) is called gradient descent and would update \(w\) to decrease \(C\left(w\right)\) by the largest amount.
📗 The \(\alpha\) in \(w = w - \alpha \nabla_{w} C\) is called the learning rate and determines how large each gradient descent step will be.
➩ The learning rate can be constant, for example, \(\alpha = 1\), \(\alpha = 0.1\), or \(\alpha = 0.01\); or decreasing, \(\alpha = \dfrac{1}{t}\), \(\alpha = \dfrac{0.1}{t}\), or \(\alpha = \dfrac{1}{\sqrt{t}}\) in iteration \(t\), and they can be adaptive based on the gradient of previous iterations or the second derivative (Newton's method).
📗 Notes and code adapted from the course taught by Professors Gurmail Singh, Yiyin Shen, Tyler Caraza-Harter.
📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.
📗 Anonymous feedback can be submitted to: Form. Non-anonymous feedback and questions can be posted on Piazza: Link