📗 If the label \(y\) is continuous, it can be predicted using \(\hat{f}\left(x'\right) = w_{1} x'_{1} + w_{2} x'_{2} + ... + w_{m} x'_{m} + b\).
➩ scipy.linalg.lstsq(x, y) can be used to find the weights \(w\) and the bias \(b\): Doc.
➩ It computes the least-squares solution to \(X w = y\), or the \(w\) such that \(\left\|y - X w\right\|\) = \(\displaystyle\sum_{i=1}^{n} \left(y_{i} - w_{1} x_{i1} - w_{2} x_{i2} - ... - w_{m} x_{im} - b\right)^{2}\) is minimized.
➩ sklearn.linear_model.LinearRegression performs the same linear regression.
Item
Input (Features)
Output (Labels)
-
1
\(\left(x_{11}, x_{12}, ..., x_{1m}\right)\)
\(y_{1} \in \mathbb{R}\)
training data
2
\(\left(x_{21}, x_{22}, ..., x_{2m}\right)\)
\(y_{2} \in \mathbb{R}\)
-
3
\(\left(x_{31}, x_{32}, ..., x_{3m}\right)\)
\(y_{3} \in \mathbb{R}\)
-
...
...
...
...
n
\(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\)
\(y_{n} \in \mathbb{R}\)
used to figure out \(y \approx \hat{f}\left(x\right)\)
📗 R squared, or coefficient of determination, \(R^{2}\) is the fraction of the variation of \(y\) that can be explained by the variation of \(x\).
➩ The formula for computing R squared is: \(R^{2} = \dfrac{\displaystyle\sum_{i=1}^{n} \left(\hat{y}_{i} - \overline{y}\right)^{2}}{\displaystyle\sum_{i=1}^{n} \left(y_{i} - \overline{y}\right)^{2}}\) or \(R^{2} = 1 - \dfrac{\displaystyle\sum_{i=1}^{n} \left(y_{i} - \hat{y}_{i}\right)^{2}}{\displaystyle\sum_{i=1}^{n} \left(y_{i} - \overline{y}\right)^{2}}\), where \(y_{i}\) is the true value of the output (label) of item \(i\), \(\hat{y}_{i}\) is the predicted value, and \(\overline{y}\) is the average value.
➩ Since R squared increase as the number of features increase, it is not a good measure of fitness, and sometimes adjusted R squared is used: \(\overline{R}^{2} = 1 - \left(1 - R^{2}\right) \dfrac{n - 1}{n - m - 1}\).
➩ If model is a sklearn.linear_model.LinearRegression, then model.score(x, y) will compute the \(R^{2}\) on the training set \(x, y\).
📗 The 95 percent confidence interval is the interval such that with 95 percent probability, the true weights (coefficients) are contained in this interval.
➩ If the 95 percent confidence interval of a weight does not contain 0, then the corresponding feature is called statistically significant, or having a statistically significant relationship with the output (label).
➩ Alternatively, the p-value of a specific feature is the probability of observing the dataset if the weight (coefficient) of that feature is 0.
➩ Given a threshold significance level of 5 percent, if the p-value is less than 5 percent, then the corresponding feature is called statistically significant, or having a statistically significant relationship with the output (label).
➩ statsmodels library can be used for statistical inference tasks: Link.
📗 If a feature is continuous, then the weight (coefficient) represents the expected change in the output if the feature changes by 1 unit, holding every other feature constant.
➩ Similar to the kernel trick and polynomial features for classification, polynomial features can be added using sklearn.preprocessing.PolynomialFeatures(degree = n). Note that the interaction terms are added as well, for example, if there are two columns \(x_{1}, x_{2}\), then the columns that will be added when degree = 2 are \(1, x_{1}, x_{2}, x_{1}^{2}, x_{2}^{2}, x_{1} x_{2}\).
📗 Discrete features are converted using one hot encoding.
➩ One of the categories should be treated as the base category, since if all categories are included, the design matrix will not be invertible: sklearn.preprocessing.OneHotEncoder(drop = "first") could be used to add the one hot encoding columns excluding the first category.
➩ The weight (coefficient) of \(1_{x_{j} = k}\) represents the expected change in the output if the feature \(j\) is in the category \(k\) instead of the base class.
📗 \(X\) is a matrix with \(n\) rows and \(m + 1\) columns, called the design matrix, where each row of \(X\) is a list of features of a training item plus a \(1\) at the end, meaning row \(i\) of \(X\) is \(\left(x_{i1}, x_{i2}, x_{i3}, ..., x_{im}, 1\right)\).
➩ The transpose of \(X\), denoted by \(X^\top\), flips the matrix over its diagonal, which means each column of \(X^\top\) is a training item with a \(1\) at the bottom.
➩ In numpy, v @ w computes the dot product between v and w, for example, numpy.array([a, b, c]) @ numpy.array([A, B, C]) means a * A + b * B + c * C.
➩ If v and w are matrices (2D array), then v @ w computes the matrix product, which is the dot product between the rows of v and the columns of w, for example, numpy.array([[a, b, c], [d, e, f], [g, h, i]]) @ numpy.array([[A, B, C], [D, E, F], [G, H, I]]) means numpy.array([[a * A + b * D + c * G, a * B + b * E + c * H, a * C + b * F + c * I], [c * A + d * D + e * G, c * B + d * E + e * H, c * C + d * F + e * I], [g * A + h * D + i * G, g * B + h * E + i * H, g * C + h * F + i * I]).
➩ In matrix notation, \(\begin{bmatrix} a & b & c \\ d & e & f \\ g & h & i \end{bmatrix} \begin{bmatrix} A & B & C \\ D & E & F \\ G & H & I \end{bmatrix}\) = \(\begin{bmatrix} a A + b D + c G & a B + b E + c H & a C + b F + c I \\ d A + e D + f G & d B + e E + f H & d C + e F + f I \\ g A + h D + i G & g B + h E + i H & g C + h F + i I \end{bmatrix}\).
➩ Matrix multiplications can be performed more efficiently (faster) than dot products looping over the columns and rows.
📗 \(X w = y\) can be solved using \(w = y / X\) (not proper notation) or \(w = X^{-1} y\) only if \(X\) is square and invertible.
➩ \(X\) has \(n\) rows and \(m\) columns so it is usually not square and thus not invertible.
➩ \(X^\top X\) has \(m + 1\) rows and \(m + 1\) columns and is invertible if \(X\) has linearly independent columns (the features are not linearly related).
➩ \(X^\top X w = X^\top y\) is used instead of \(X w = y\), which can be solved as \(w = \left(X^\top X\right)^{-1} \left(X^\top y\right)\).
📗 scipy.linalg.inv(A) can be used to compute the inverse of A: Doc.
➩ scipy.linalg.solve(A, b) can be used to solve for \(w\) in \(A w = b\) and is faster than computing the inverse: Doc.
➩ The reason is that computing the inverse is effectively solving \(A w = e_{1}\), \(A w = e_{2}\), ... \(A w = e_{n}\), where \(e_{j}\) is the vector with \(1\) as position \(j\) and \(0\) everywhere else, for example, \(e_{1} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ ... \end{bmatrix}\), \(e_{2} = \begin{bmatrix} 0 \\ 1 \\ 0 \\ ... \end{bmatrix}\), \(e_{3} = \begin{bmatrix} 0 \\ 0 \\ 1 \\ ... \end{bmatrix}\) ...
➩ Sometimes, a permutation matrix is required to reorder the rows of \(A\), so \(P A = L U\) is used, where \(P\) is a permutation matrix (reordering of the rows of the identity matrix \(I\)).
➩ scipy.linalg.lu(A) can be used to find \(P, L, U\) matrices: Doc.
📗 Solving \(A w = b\) and \(A w = c\) involves computing the same LU decomposition for \(A\) twice.
➩ It is faster to compute the LU decomposition once and then solve using the LU matrices instead of \(A\).
➩ scipy.linalg.lu_factor(A) can be used to find the \(L, U\) matrices: Doc.
➩ scipy.linalg.lu_solve((lu, p), b) can be used to solve \(A w = b\) where lu is the LU decomposition and p is the permutation.
📗 Comparison for Solving Multiple Systems
➩ To solve \(A w = b\), \(A w = c\) for square invertible \(A\):
Method
Procedure
Speed comparison
1
inv(A) @ b then inv(A) @ c
Slow
2
solve(A, b) then solve(A, c)
Fast
3
lu, p = lu_factor(A) then lu_solve((lu, p), b) then lu_solve((lu, p), c)
Faster
➩ When \(A = X^\top X\) and \(b = X^\top y\), solving \(A w = b\) leads to the solution to the linear regression problem. If the same features are used to make predictions for different prediction variables, it is faster to use lu_solve.
📗 Notes and code adapted from the course taught by Professors Gurmail Singh, Yiyin Shen, Tyler Caraza-Harter.
📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.
📗 Anonymous feedback can be submitted to: Form. Non-anonymous feedback and questions can be posted on Piazza: Link