# Lecture Notes
📗 Least Squares Regression
➩ If the label \(y\) is continuous, it can still be predicted using \(\hat{f}\left(x'\right) = w_{1} x'_{1} + w_{2} x'_{2} + ... + w_{m} x'_{m} + b\).
➩
scipy.linalg.lstsq(x, y)
can be used to find the weights \(w\) and the bias \(b\):
Doc.
➩ It computes the least-squares solution to \(X w = y\), or the \(w\) such that \(\left\|y - X w\right\|\) = \(\displaystyle\sum_{i=1}^{n} \left(y_{i} - w_{1} x_{i1} - w_{2} x_{i2} - ... - w_{m} x_{im} - b\right)^{2}\) is minimized.
➩ sklearn.linear_model.LinearRegression
performs the same linear regression.
Item |
Input (Features) |
Output (Labels) |
- |
1 |
\(\left(x_{11}, x_{12}, ..., x_{1m}\right)\) |
\(y_{1} \in \mathbb{R}\) |
training data |
2 |
\(\left(x_{21}, x_{22}, ..., x_{2m}\right)\) |
\(y_{2} \in \mathbb{R}\) |
- |
3 |
\(\left(x_{31}, x_{32}, ..., x_{3m}\right)\) |
\(y_{3} \in \mathbb{R}\) |
- |
... |
... |
... |
... |
n |
\(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\) |
\(y_{n} \in \mathbb{R}\) |
used to figure out \(y \approx \hat{f}\left(x\right)\) |
new |
\(\left(x'_{1}, x'_{2}, ..., x'_{m}\right)\) |
\(y' \in \mathbb{R}\) |
guess \(y' = \hat{f}\left(x\right)\) |
Design Matrix
➩ \(X\) is a matrix with \(n\) rows and \(m + 1\) columns, called the design matrix, where each row of \(X\) is a list of features of a training item plus a \(1\) at the end, meaning row \(i\) of \(X\) is \(\left(x_{i1}, x_{i2}, x_{i3}, ..., x_{im}, 1\right)\).
➩ The transpose of \(X\), denoted by \(X^\top\), flips the matrix over its diagonal, which means each column of \(X^\top\) is a training item with a \(1\) at the bottom.
📗 Matrix Inversion
➩ \(X w = y\) can be solved using \(w = y / X\) (not proper notation) or \(w = X^{-1} y\) only if \(X\) is square and invertible.
➩ \(X\) has \(n\) rows and \(m\) columns so it is usually not square and thus not invertible.
➩ \(X^\top X\) has \(m + 1\) rows and \(m + 1\) columns and is invertible if \(X\) has linearly independent columns (the features are not linearly related).
➩ \(X^\top X w = X^\top y\) is used instead of \(X w = y\), which can be solved as \(w = \left(X^\top X\right)^{-1} \left(X^\top y\right)\).
📗 Matrix Inverses
➩
scipy.linalg.inv(A)
can be used to compute the inverse of
A
:
Doc.
➩
scipy.linalg.solve(A, b)
can be used to solve for \(w\) in \(A w = b\) and is faster than computing the inverse:
Doc.
➩ The reason is that computing the inverse is effectively solving \(A w = e_{1}\), \(A w = e_{2}\), ... \(A w = e_{n}\), where \(e_{j}\) is the vector with \(1\) as position \(j\) and \(0\) everywhere else, for example, \(e_{1} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ ... \end{bmatrix}\), \(e_{2} = \begin{bmatrix} 0 \\ 1 \\ 0 \\ ... \end{bmatrix}\), \(e_{3} = \begin{bmatrix} 0 \\ 0 \\ 1 \\ ... \end{bmatrix}\) ...
Grade Regression Example
➩ Find the linear relationship between exam 1 and exam 2 grades.
➩ Code for simple linear regression:
Notebook.
➩ Code for multiple regression:
Notebook.
LU Decomposition
➩ A square matrix \(A\) can be written as \(A = L U\), where \(L\) is a lower triangular matrix and \(U\) is an upper triangular matrix.
➩ For example, if \(A\) is 3 by 3, then \(\begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{bmatrix} = \begin{bmatrix} l_{11} & 0 & 0 \\ l_{21} & l_{22} & 0 \\ l_{31} & l_{32} & l_{33} \end{bmatrix} \begin{bmatrix} u_{11} & u_{12} & u_{13} \\ 0 & u_{22} & u_{23} \\ 0 & 0 & u_{33} \end{bmatrix}\).
➩ Sometimes, a permutation matrix is required to reorder the rows of \(A\), so \(P A = L U\) is used, where \(P\) is a permutation matrix (reordering of the rows of the identity matrix \(I\)).
➩
scipy.linalg.lu(A)
can be used to find \(P, L, U\) matrices:
Doc.
📗 LU Decomposition Solve
➩ Solving \(A w = b\) and \(A w = c\) involves computing the same LU decomposition for \(A\) twice.
➩ It is faster to compute the LU decomposition once and then solve using the LU matrices instead of \(A\).
➩
scipy.linalg.lu_factor(A)
can be used to find the \(L, U\) matrices:
Doc.
➩ scipy.linalg.lu_solve((lu, p), b)
can be used to solve \(A w = b\) where lu
is the LU decomposition and p
is the permutation.
📗 Comparison for Solving Multiple Systems
➩ To solve \(A w = b\), \(A w = c\) for square invertible \(A\):
Method |
Procedure |
Speed comparison |
1 |
inv(A) @ b then inv(A) @ c |
Slow |
2 |
solve(A, b) then solve(A, c) |
Fast |
3 |
lu, p = lu_factor(A) then lu_solve((lu, p), b) then lu_solve((lu, p), c) |
Faster |
➩ When \(A = X^\top X\) and \(b = X^\top y\), solving \(A w = b\) leads to the solution to the linear regression problem. If the same features are used to make predictions for different prediction variables, it is faster to use lu_solve
.
Numerical Instability
➩ Division by a small number close to 0 may lead to inaccurate answers.
➩ Inverting or solving a matrix close to 0 could lead to inaccurate solutions too.
➩ A matrix being close to 0 is usually defined by its condition number, not determinant.
➩
numpy.linalg.cond
can be used find the condition number:
Doc.
➩ Larger condition number means the solution can more inaccurate.
TopHat Invertibility Discussion
➩ Discuss what should be the solution and why Python computes it incorrectly.
📗 Multicollinearity
➩ In linear regression, large condition number of the design matrix is related to multicollinearity.
➩ Multicollinearity occurs when multiple features are highly linearly correlated.
➩ One simple rule of thumb is that the regression has multicollinearity if the condition number of larger than 30.
Notes and code adapted from the course taught by Yiyin Shen
Link and Tyler Caraza-Harter
Link