Young Wu's Homepage

Prev: L36, Next: L38

# Lecture

📗 The lecture is in person, but you can join Zoom: 8:50-9:40 or 11:00-11:50. Zoom recordings can be viewed on Canvas -> Zoom -> Cloud Recordings. They will be moved to Kaltura over the weekends.

📗 The in-class (participation) quizzes should be submitted on TopHat (Code:741565), but you can submit your answers through Form at the end of the lectures too.

📗 The Python notebooks used during the lectures can also be found on: GitHub. They will be updated weekly.

# Lecture Notes

📗 Loss Minimization

➩ Logistic regression, neural network, and linear regression compute the best weights and biases by solving an optimization problem: \(\displaystyle\min_{w,b} C\left(w, b\right)\), where \(C\) is the loss (cost) function that measures the amount of error the model is making.

➩ The search strategy is to start with a random set of weights and biases and iteratively move to another set of weights and biases with a lower loss: Link.

Logistic Regression Example

➩ Write out the loss minimization problem for logistic regression in 1D and compare the solutions with the coefficients from sklearn.

➩ Code for logistic regression: Notebook.

📗 Nelder Mead

➩ Nelder Mead method (downhill simplex method) is a derivative-free method that only requires the knowledge of the objective function, and not its derivatives.

➩ Start with a random simplex: a line in 1D, a triangle in 2D, a tetrahedron (triangular pyramid) in 3D, ..., a polytope with \(n + 1\) vertices in \(n\)D.

➩ In every iteration, replace the worst point (the one with the largest loss among the \(n + 1\) vertices), by its reflection through the centriod of the remaining \(n\) points.

➩ If the new point is better than the original point, expand the simplex; if the new point is worse than the original point, shrink the simplex.

Derivatives

➩ For a single-variable function \(C\left(w\right)\), if the derivative at \(w\), \(C'\left(w\right)\), is positive then decreasing \(w\) would decrease \(C\left(w\right)\), and if the derivative at \(w\) is negative then increasing \(w\) would decrease \(C\left(w\right)\).

\(C'\left(w\right)\) sign	\(\left\| C'\left(w\right) \right\|\) magnitude	How to decrease \(C\left(w\right)\)
Positive	Small	Decrease \(w\) by a little
Positive	Large	Decrease \(w\) by a lot
Negative	Small	Increase \(w\) by a little
Negative	Large	Increase \(w\) by a lot

➩ In general, \(w = w - \alpha C'\left(w\right)\) would update \(w\) to decrease \(C\left(w\right)\) by the largest amount.

Derivative Example

📗 Gradient

➩ If there are more than one features, the vector of derivatives, one for each weight, is called the gradient vector, denoted by \(\nabla_{w} C\) = \(\begin{bmatrix} \dfrac{\partial C}{\partial w_{1}} \\ \dfrac{\partial C}{\partial w_{2}} \\ ... \\ \dfrac{\partial C}{\partial w_{m}} \end{bmatrix}\), pronounced as "the gradient of C with respect to w" or "D w C" (not "Delta w C").

➩ The gradient vector represent rate and direction of the fastest change.

➩ \(w = w - \alpha \nabla_{w} C\) is called gradient descent and would update \(w\) to decrease \(C\left(w\right)\) by the largest amount.

📗 Learning Rates

➩ The \(\alpha\) in \(w = w - \alpha \nabla_{w} C\) is called the learning rate and determines how large each gradient descent step will be.

➩ The learning rate can be constant, for example, \(\alpha = 1\), \(\alpha = 0.1\), or \(\alpha = 0.01\); or decreasing, \(\alpha = \dfrac{1}{t}\), \(\alpha = \dfrac{0.1}{t}\), or \(\alpha = \dfrac{1}{\sqrt{t}}\) in iteration \(t\), and they can be adaptive based on the gradient of previous iterations or the second derivative (Newton's method).

Newton's Method

➩ The Hessian matrix is the matrix of second derivatives, for a single variable function \(C\left(w\right)\), it is just \(C''\left(w\right)\), and for multiple variables, it is \(\nabla^{2}_{w} C\) = \(\begin{bmatrix} \dfrac{\partial^2 C}{\partial w_{1}^2} & \dfrac{\partial C}{\partial w_{1} \partial w_{2}} & ... & \dfrac{\partial C}{\partial w_{1} \partial w_{m}} \\ \dfrac{\partial C}{\partial w_{2} \partial w_{1}} & \dfrac{\partial^2 C}{\partial w_{2}^2} & ... & \dfrac{\partial C}{\partial w_{2} \partial w_{m}} \\ ... & ... & ... & ... \\ \dfrac{\partial C}{\partial w_{m} \partial w_{1}} & \dfrac{\partial C}{\partial w_{m}} \left(w_{2}\right) & ... & \dfrac{\partial^2 C}{\partial w_{m}^2} \end{bmatrix}\).

➩ It can be used compute the step size adaptively, but it is usually too costly to compute and invert the Hessian matrix.

➩ Newton's method use the iterative update formula: \(w = w - \alpha \left[\nabla^{2}_{w} C\right]^{-1} \nabla_{w} C\).

📗 Broyden–Fletcher–Goldfarb–Shanno (BFGS) Method

➩ To avoid computing and approximating the Hessian matrix, BFGS uses gradient from the previous step to approximate the inverse of the Hessian, and perform line search to find the step size.

➩ This method is a quasi-Newton method, and does not require specifying the Hessian matrix.

📗 Scipy Optimization Function

➩ scipy.optimize.minimize(f, x0) minimizes the function f with initial guess x0, and the methods include derivative-free methods such as Nelder-Mead, gradient methods such as BFGS (need to specify the gradient as the jac parameter, Jacobian matrix whose rows are gradient vectors), and methods that use Hessian such as Newton-CG (CG stands for conjugate gradient, a way to approximately compute \(\left[\nabla^{2}_{w} C\right]^{-1} \nabla_{w} C\), need to specify the Hessian matrix as the hess parameter).

➩ maxiter specifies the maximum number of iterations to perform and displays a message when the optimization does not converge when maxiter is reached.

➩ tol specifies when the optimization is considered converged, usually it means either the difference between the function (loss) values between two consecutive iterations is less than tol, or the distance between the argument (weights) is less than tol.

📗 Local Minima

➩ Gradient methods may get stuck at a local minimum, where the gradient at the point is 0, but the point is not the (global) minimum of the function.

➩ Multiple random initial points may be used to find multiple local minima, and they can be compared to find the best one.

➩ If a function is convex, then the only local minimum is the global minimum.

➩ The reason cross-entropy loss is used for logistic regression to measure the error is because the resulting \(C\left(w\right)\) is convex and differentiable, thus easy to minimize using gradient methods.

📗 Numerical Gradient

➩ The gradient can be provided as functions of \(w\), but if it is not easy to compute analytically, numerical approximations can be used too.

➩ Newton's finite difference approximation of the derivative is \(C'\left(w\right) \approx \dfrac{C\left(w + h\right) - C\left(w\right)}{h}\).

➩ Another two-point symmetric difference approximation is \(C'\left(w\right) \approx \dfrac{C\left(w + h\right) - C\left(w - h\right)}{2 h}\).

➩ Five-point stencil approximation is given by \(C'\left(w\right) \approx \dfrac{-C\left(w + 2 h\right) + 8 C\left(w + h\right) - 8 C\left(w - h\right) + C\left(w - 2 h\right)}{12 h}\).

➩ The Hessian matrix can be similarly approximated.

Comparison of Methods

➩ Compare the iteration of the methods on the Rosenbrock function: Link.

➩ Code for minimizing Rosenbrock: Notebook.

Additional Example

➩ The gradient vector at the current iteration of gradient descent is \(\nabla_{w} C \left(\begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix}\right) = \begin{bmatrix} -1 \\ -2 \\ -3 \\ -4 \end{bmatrix}\) and the learning rate \(\alpha = 1\), what will be the weights in the next iteration?

➩ Use the formula: \(w = w - \alpha \nabla_{w} C\) which is \(\begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} - 1 \begin{bmatrix} -1 \\ -2 \\ -3 \\ -4 \end{bmatrix}\) = \(\begin{bmatrix} 2 \\ 4 \\ 6 \\ 8 \end{bmatrix}\).

➩ Move the points to see the derivatives (slope of tangent line) of the function \(x^{2}\):

Point: 0
Learning rate: 0.5
Derivative: 0

Notes and code adapted from the course taught by Yiyin Shen Link and Tyler Caraza-Harter Link

Last Updated: July 01, 2025 at 1:46 AM