Young Wu's Homepage

Prev: W9, Next: W11

Zoom: Link, TopHat: Link (936525), GoogleForm: Link, Piazza: Link, Feedback: Link, GitHub: Link, Sec1&2: Link

Slide:

# Slides and Notes

📗 From sections 1 and 2:

➩ Regression slides: Link.

➩ Pipeline slides: Link.

📗 Sklearn: Link.

# Least Squares Regression

📗 If the label \(y\) is continuous, it can be predicted using \(\hat{f}\left(x'\right) = w_{1} x'_{1} + w_{2} x'_{2} + ... + w_{m} x'_{m} + b\).

➩ scipy.linalg.lstsq(x, y) can be used to find the weights \(w\) and the bias \(b\): Doc.

➩ It computes the least-squares solution to \(X w = y\), or the \(w\) such that \(\left\|y - X w\right\|\) = \(\displaystyle\sum_{i=1}^{n} \left(y_{i} - w_{1} x_{i1} - w_{2} x_{i2} - ... - w_{m} x_{im} - b\right)^{2}\) is minimized.

➩ sklearn.linear_model.LinearRegression performs the same linear regression.

Item	Input (Features)	Output (Labels)	-
1	\(\left(x_{11}, x_{12}, ..., x_{1m}\right)\)	\(y_{1} \in \mathbb{R}\)	training data
2	\(\left(x_{21}, x_{22}, ..., x_{2m}\right)\)	\(y_{2} \in \mathbb{R}\)	-
3	\(\left(x_{31}, x_{32}, ..., x_{3m}\right)\)	\(y_{3} \in \mathbb{R}\)	-
...	...	...	...
n	\(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\)	\(y_{n} \in \mathbb{R}\)	used to figure out \(y \approx \hat{f}\left(x\right)\)
new	\(\left(x'_{1}, x'_{2}, ..., x'_{m}\right)\)	\(y' \in \mathbb{R}\)	guess \(y' = \hat{f}\left(x\right)\)

# Loss Function and R Squared

📗 R squared, or coefficient of determination, \(R^{2}\) is the fraction of the variation of \(y\) that can be explained by the variation of \(x\).

➩ The formula for computing R squared is: \(R^{2} = \dfrac{\displaystyle\sum_{i=1}^{n} \left(\hat{y}_{i} - \overline{y}\right)^{2}}{\displaystyle\sum_{i=1}^{n} \left(y_{i} - \overline{y}\right)^{2}}\) or \(R^{2} = 1 - \dfrac{\displaystyle\sum_{i=1}^{n} \left(y_{i} - \hat{y}_{i}\right)^{2}}{\displaystyle\sum_{i=1}^{n} \left(y_{i} - \overline{y}\right)^{2}}\), where \(y_{i}\) is the true value of the output (label) of item \(i\), \(\hat{y}_{i}\) is the predicted value, and \(\overline{y}\) is the average value.

➩ Since R squared increase as the number of features increase, it is not a good measure of fitness, and sometimes adjusted R squared is used: \(\overline{R}^{2} = 1 - \left(1 - R^{2}\right) \dfrac{n - 1}{n - m - 1}\).

➩ If model is a sklearn.linear_model.LinearRegression, then model.score(x, y) will compute the \(R^{2}\) on the training set \(x, y\).

# Statistical Inference

📗 The 95 percent confidence interval is the interval such that with 95 percent probability, the true weights (coefficients) are contained in this interval.

➩ If the 95 percent confidence interval of a weight does not contain 0, then the corresponding feature is called statistically significant, or having a statistically significant relationship with the output (label).

➩ Alternatively, the p-value of a specific feature is the probability of observing the dataset if the weight (coefficient) of that feature is 0.

➩ Given a threshold significance level of 5 percent, if the p-value is less than 5 percent, then the corresponding feature is called statistically significant, or having a statistically significant relationship with the output (label).

➩ statsmodels library can be used for statistical inference tasks: Link.

# Continuous Features

📗 If a feature is continuous, then the weight (coefficient) represents the expected change in the output if the feature changes by 1 unit, holding every other feature constant.

➩ Similar to the kernel trick and polynomial features for classification, polynomial features can be added using sklearn.preprocessing.PolynomialFeatures(degree = n). Note that the interaction terms are added as well, for example, if there are two columns \(x_{1}, x_{2}\), then the columns that will be added when degree = 2 are \(1, x_{1}, x_{2}, x_{1}^{2}, x_{2}^{2}, x_{1} x_{2}\).

# Discrete Features

📗 Discrete features are converted using one hot encoding.

➩ One of the categories should be treated as the base category, since if all categories are included, the design matrix will not be invertible: sklearn.preprocessing.OneHotEncoder(drop = "first") could be used to add the one hot encoding columns excluding the first category.

➩ The weight (coefficient) of \(1_{x_{j} = k}\) represents the expected change in the output if the feature \(j\) is in the category \(k\) instead of the base class.

# Log Transforms of Features

📗 Log transformations can be used on the features too.

➩ The weight (coefficient) of \(\log\left(x_{j}\right)\) represents the expected change in the output if the feature is increased by 1 percent.

➩ If \(\log\left(y\right)\) is used in place of \(y\), then weights represent the percentage change in y due to a change in the feature.

# Design Matrix

📗 \(X\) is a matrix with \(n\) rows and \(m + 1\) columns, called the design matrix, where each row of \(X\) is a list of features of a training item plus a \(1\) at the end, meaning row \(i\) of \(X\) is \(\left(x_{i1}, x_{i2}, x_{i3}, ..., x_{im}, 1\right)\).

➩ The transpose of \(X\), denoted by \(X^\top\), flips the matrix over its diagonal, which means each column of \(X^\top\) is a training item with a \(1\) at the bottom.

# Matrix Multiplication

➩ In numpy, v @ w computes the dot product between v and w, for example, numpy.array([a, b, c]) @ numpy.array([A, B, C]) means a * A + b * B + c * C.

➩ If v and w are matrices (2D array), then v @ w computes the matrix product, which is the dot product between the rows of v and the columns of w, for example, numpy.array([[a, b, c], [d, e, f], [g, h, i]]) @ numpy.array([[A, B, C], [D, E, F], [G, H, I]]) means

numpy.array([[a * A + b * D + c * G, a * B + b * E + c * H, a * C + b * F + c * I], [c * A + d * D + e * G, c * B + d * E + e * H, c * C + d * F + e * I], [g * A + h * D + i * G, g * B + h * E + i * H, g * C + h * F + i * I])

➩ In matrix notation, \(\begin{bmatrix} a & b & c \\ d & e & f \\ g & h & i \end{bmatrix} \begin{bmatrix} A & B & C \\ D & E & F \\ G & H & I \end{bmatrix}\) = \(\begin{bmatrix} a A + b D + c G & a B + b E + c H & a C + b F + c I \\ d A + e D + f G & d B + e E + f H & d C + e F + f I \\ g A + h D + i G & g B + h E + i H & g C + h F + i I \end{bmatrix}\).

➩ Matrix multiplications can be performed more efficiently (faster) than dot products looping over the columns and rows.

# Matrix Inversion

📗 \(X w = y\) can be solved using \(w = y / X\) (not proper notation) or \(w = X^{-1} y\) only if \(X\) is square and invertible.

➩ \(X\) has \(n\) rows and \(m\) columns so it is usually not square and thus not invertible.

➩ \(X^\top X\) has \(m + 1\) rows and \(m + 1\) columns and is invertible if \(X\) has linearly independent columns (the features are not linearly related).

➩ \(X^\top X w = X^\top y\) is used instead of \(X w = y\), which can be solved as \(w = \left(X^\top X\right)^{-1} \left(X^\top y\right)\).

# Matrix Inverses

📗 scipy.linalg.inv(A) can be used to compute the inverse of A: Doc.

➩ scipy.linalg.solve(A, b) can be used to solve for \(w\) in \(A w = b\) and is faster than computing the inverse: Doc.

➩ The reason is that computing the inverse is effectively solving \(A w = e_{1}\), \(A w = e_{2}\), ... \(A w = e_{n}\), where \(e_{j}\) is the vector with \(1\) as position \(j\) and \(0\) everywhere else, for example, \(e_{1} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ ... \end{bmatrix}\), \(e_{2} = \begin{bmatrix} 0 \\ 1 \\ 0 \\ ... \end{bmatrix}\), \(e_{3} = \begin{bmatrix} 0 \\ 0 \\ 1 \\ ... \end{bmatrix}\) ...

# LU Decomposition

📗 A square matrix \(A\) can be written as \(A = L U\), where \(L\) is a lower triangular matrix and \(U\) is an upper triangular matrix.

➩ For example, if \(A\) is 3 by 3, then \(\begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{bmatrix} = \begin{bmatrix} l_{11} & 0 & 0 \\ l_{21} & l_{22} & 0 \\ l_{31} & l_{32} & l_{33} \end{bmatrix} \begin{bmatrix} u_{11} & u_{12} & u_{13} \\ 0 & u_{22} & u_{23} \\ 0 & 0 & u_{33} \end{bmatrix}\).

➩ Sometimes, a permutation matrix is required to reorder the rows of \(A\), so \(P A = L U\) is used, where \(P\) is a permutation matrix (reordering of the rows of the identity matrix \(I\)).

➩ scipy.linalg.lu(A) can be used to find \(P, L, U\) matrices: Doc.

# LU Decomposition Solve

📗 Solving \(A w = b\) and \(A w = c\) involves computing the same LU decomposition for \(A\) twice.

➩ It is faster to compute the LU decomposition once and then solve using the LU matrices instead of \(A\).

➩ scipy.linalg.lu_factor(A) can be used to find the \(L, U\) matrices: Doc.

➩ scipy.linalg.lu_solve((lu, p), b) can be used to solve \(A w = b\) where lu is the LU decomposition and p is the permutation.

📗 Comparison for Solving Multiple Systems

➩ To solve \(A w = b\), \(A w = c\) for square invertible \(A\):

Method	Procedure	Speed comparison
1	`inv(A) @ b` then `inv(A) @ c`	Slow
2	`solve(A, b)` then `solve(A, c)`	Fast
3	`lu, p = lu_factor(A)` then `lu_solve((lu, p), b)` then `lu_solve((lu, p), c)`	Faster

➩ When \(A = X^\top X\) and \(b = X^\top y\), solving \(A w = b\) leads to the solution to the linear regression problem. If the same features are used to make predictions for different prediction variables, it is faster to use lu_solve.

# Numerical Instability

📗 Division by a small number close to 0 may lead to inaccurate answers.

➩ Inverting or solving a matrix close to 0 could lead to inaccurate solutions too.

➩ A matrix being close to 0 is usually defined by its condition number, not determinant.

➩ numpy.linalg.cond can be used find the condition number: Doc.

➩ Larger condition number means the solution can more inaccurate.

# Multicollinearity

📗 In linear regression, large condition number of the design matrix is related to multicollinearity.

➩ Multicollinearity occurs when multiple features are highly linearly correlated.

➩ One simple rule of thumb is that the regression has multicollinearity if the condition number of larger than 30.

# Questions?

📗 Notes and code adapted from the course taught by Professors Gurmail Singh, Yiyin Shen, Tyler Caraza-Harter.

📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.

📗 Anonymous feedback can be submitted to: Form. Non-anonymous feedback and questions can be posted on Piazza: Link

Prev: W9, Next: W11

Last Updated: June 27, 2026 at 9:06 PM