Young Wu's Homepage

Prev: W11, Next: W13

Zoom: Link, TopHat: Link (936525), GoogleForm: Link, Piazza: Link, Feedback: Link, GitHub: Link, Sec1&2: Link

Slide:

# Nonlinear Classifiers

📗 Non-linear classifiers are classifiers with non-linear decision boundaries.

➩ Non-linear models are difficult to estimate directly in general.

➩ Two ways of creating non-linear classifiers are,

(1) Non-linear transformations of the features (for example, kernel support vector machine)
(2) Combining multiple copies of linear classifiers (for example, neural network, decision tree)

# Sklearn Pipeline

📗 New features can be constructed manually, or through using transformers provided by sklearn in a sklearn.Pipeline.

➩ A categorical column can be converted to multiple columns using sklearn.preprocessing.OneHotEncoder: Doc.

➩ A numerical column can normalized to center at 0 with variance 1 using sklearn.preprocessing.StandardScaler: Doc.

➩ Additional columns including powers of one column can be added using sklearn.preprocessing.PolynomialFeatures: Doc.

Pipeline Example

➩ Predict whether income exceeds 50K per year based on census data: Link.

# Kernel Trick

📗 sklearn.svm.SVC can be used to train kernel SVMs, possibly infinite number of new features, efficiently through dual optimization (more detail about this in the Linear Programming lecture): Doc

➩ Available kernel functions include: linear (no new features), polynomial (degree d polynomial features), rbf (Radial Basis Function, infinite number of new features).

Kernel Trick Example

ID:

➩ Transform the points (using the kernel) and move the plane such that the plane separates the two classes.

Kernel: 0
Plane: 0

# Neural Network

📗 Neural networks (also called multilayer perceptron) can be viewed as multiple layers of logistic regressions (or perceptrons with other activation functions).

➩ The outputs of the previous layers are used as the inputs in the next layer.

➩ The layers in between the inputs \(x\) and output \(y\) are hidden layers and can be viewed as additional internal features generated by the neural network.

# Sklearn vs PyTorch

➩ sklearn.neural_network.MLPClassifier can be used to train fully connect neural networks without convolutional layers or transformer modules. The activation functions logistic, tanh, and relu can be used: Doc

➩ PyTorch is a popular package for training more general neural networks with special layers and modules, and with custom activation functions: Link.

TopHat Activity

➩ Compare neural networks with different architecture (number of hidden layers, units), and different activation functions (ReLU, tanh, Sigmoud (logistic)) here: Link.

➩ Discuss how they behave differently on different datasets, for example, training speed, decision boundary, number of non-zero weights, etc.

# Loss Minimization

📗 Logistic regression, neural network, and linear regression compute the best weights and biases by solving an optimization problem: \(\displaystyle\min_{w,b} C\left(w, b\right)\), where \(C\) is the loss (cost) function that measures the amount of error the model is making.

➩ The search strategy is to start with a random set of weights and biases and iteratively move to another set of weights and biases with a lower loss: Link.

# Derivatives

📗 For a single-variable function \(C\left(w\right)\), if the derivative at \(w\), \(C'\left(w\right)\), is positive then decreasing \(w\) would decrease \(C\left(w\right)\), and if the derivative at \(w\) is negative then increasing \(w\) would decrease \(C\left(w\right)\).

\(C'\left(w\right)\) sign	\(\left\| C'\left(w\right) \right\|\) magnitude	How to decrease \(C\left(w\right)\)
Positive	Small	Decrease \(w\) by a little
Positive	Large	Decrease \(w\) by a lot
Negative	Small	Increase \(w\) by a little
Negative	Large	Increase \(w\) by a lot

➩ In general, \(w = w - \alpha C'\left(w\right)\) would update \(w\) to decrease \(C\left(w\right)\) by the largest amount.

Derivative Example

➩ Move the points to see the derivatives (slope of tangent line) of the function \(x^{2}\):

Point: 0
Learning rate: 0.5
Derivative: 0

# Gradient

📗 If there are more than one features, the vector of derivatives, one for each weight, is called the gradient vector, denoted by \(\nabla_{w} C\) = \(\begin{bmatrix} \dfrac{\partial C}{\partial w_{1}} \\ \dfrac{\partial C}{\partial w_{2}} \\ ... \\ \dfrac{\partial C}{\partial w_{m}} \end{bmatrix}\), pronounced as "the gradient of C with respect to w" or "D w C" (not "Delta w C").

➩ The gradient vector represent rate and direction of the fastest change.

➩ \(w = w - \alpha \nabla_{w} C\) is called gradient descent and would update \(w\) to decrease \(C\left(w\right)\) by the largest amount.

# Learning Rates

📗 The \(\alpha\) in \(w = w - \alpha \nabla_{w} C\) is called the learning rate and determines how large each gradient descent step will be.

➩ The learning rate can be constant, for example, \(\alpha = 1\), \(\alpha = 0.1\), or \(\alpha = 0.01\); or decreasing, \(\alpha = \dfrac{1}{t}\), \(\alpha = \dfrac{0.1}{t}\), or \(\alpha = \dfrac{1}{\sqrt{t}}\) in iteration \(t\), and they can be adaptive based on the gradient of previous iterations or the second derivative (Newton's method).

# Questions?

📗 Notes and code adapted from the course taught by Professors Gurmail Singh, Yiyin Shen, Tyler Caraza-Harter.

📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.

📗 Anonymous feedback can be submitted to: Form. Non-anonymous feedback and questions can be posted on Piazza: Link

Prev: W11, Next: W13

Last Updated: June 27, 2026 at 9:06 PM