Young Wu's Homepage

Prev: L34, Next: L36

# Lecture

📗 The lecture is in person, but you can join Zoom: 8:50-9:40 or 11:00-11:50. Zoom recordings can be viewed on Canvas -> Zoom -> Cloud Recordings. They will be moved to Kaltura over the weekends.

📗 The in-class (participation) quizzes should be submitted on TopHat (Code:741565), but you can submit your answers through Form at the end of the lectures too.

📗 The Python notebooks used during the lectures can also be found on: GitHub. They will be updated weekly.

# Lecture Notes

📗 Loss Function and R Squared

➩ R squared, or coefficient of determination, \(R^{2}\) is the fraction of the variation of \(y\) that can be explained by the variation of \(x\).

➩ The formula for computing R squared is: \(R^{2} = \dfrac{\displaystyle\sum_{i=1}^{n} \left(\hat{y}_{i} - \overline{y}\right)^{2}}{\displaystyle\sum_{i=1}^{n} \left(y_{i} - \overline{y}\right)^{2}}\) or \(R^{2} = 1 - \dfrac{\displaystyle\sum_{i=1}^{n} \left(y_{i} - \hat{y}_{i}\right)^{2}}{\displaystyle\sum_{i=1}^{n} \left(y_{i} - \overline{y}\right)^{2}}\), where \(y_{i}\) is the true value of the output (label) of item \(i\), \(\hat{y}_{i}\) is the predicted value, and \(\overline{y}\) is the average value.

➩ Since R squared increase as the number of features increase, it is not a good measure of fitness, and sometimes adjusted R squared is used: \(\overline{R}^{2} = 1 - \left(1 - R^{2}\right) \dfrac{n - 1}{n - m - 1}\).

➩ If model is a sklearn.linear_model.LinearRegression, then model.score(x, y) will compute the \(R^{2}\) on the training set \(x, y\).

Grade Regression Example, Again

➩ Find the best predictor of exam 2 grade.

➩ Code for multiple regression model selection: Notebook.

➩ In general, if one feature is dropped based on the R-squared score, the feature that leads to the least decrease in the score (when it is dropped) should be dropped. The important predictors will lead to a large decrease in the score (when they are dropped).

➩ In this example, "Exam1" is the most important feature according to the scores, and "Project" and "Lab" aer the least important features and could be dropped without affecting the effectiveness of the regression.

Statistical Inference

➩ The 95 percent confidence interval is the interval such that with 95 percent probability, the true weights (coefficients) are contained in this interval.

➩ If the 95 percent confidence interval of a weight does not contain 0, then the corresponding feature is called statistically significant, or having a statistically significant relationship with the output (label).

➩ Alternatively, the p-value of a specific feature is the probability of observing the dataset if the weight (coefficient) of that feature is 0.

➩ Given a threshold significance level of 5 percent, if the p-value is less than 5 percent, then the corresponding feature is called statistically significant, or having a statistically significant relationship with the output (label).

➩ statsmodels library can be used for statistical inference tasks: Link.

Continuous Features

➩ If a feature is continuous, then the weight (coefficient) represents the expected change in the output if the feature changes by 1 unit, holding every other feature constant.

➩ Similar to the kernel trick and polynomial features for classification, polynomial features can be added using sklearn.preprocessing.PolynomialFeatures(degree = n). Note that the interaction terms are added as well, for example, if there are two columns \(x_{1}, x_{2}\), then the columns that will be added when degree = 2 are \(1, x_{1}, x_{2}, x_{1}^{2}, x_{2}^{2}, x_{1} x_{2}\).

📗 Discrete Features

➩ Discrete features are converted using one hot encoding.

➩ One of the categories should be treated as the base category, since if all categories are included, the design matrix will not be invertible: sklearn.preprocessing.OneHotEncoder(drop = "first") could be used to add the one hot encoding columns excluding the first category.

➩ The weight (coefficient) of \(1_{x_{j} = k}\) represents the expected change in the output if the feature \(j\) is in the category \(k\) instead of the base class.

📗 Log Transforms of Features

➩ Log transformations can be used on the features too.

➩ The weight (coefficient) of \(\log\left(x_{j}\right)\) represents the expected change in the output if the feature is increased by 1 percent.

➩ If \(\log\left(y\right)\) is used in place of \(y\), then weights represent the percentage change in y due to a change in the feature.

Additional Example

➩ More examples of how polynomial features are added, including interaction terms.

➩ Code for polynomial features: Notebook.

Notes and code adapted from the course taught by Yiyin Shen Link and Tyler Caraza-Harter Link

Last Updated: May 28, 2025 at 11:50 PM