📗 Classification is the problem when \(y\) is categorical.
➩ When \(y = \left\{0, 1\right\}\), the problem is binary classification.
➩ When \(y = \left\{0, 1, ..., K\right\}\), the problem is multi-class classification.
📗 Regression is the problem when \(y\) is continuous.
➩ Logistic regression is usually used for classification problems, but since it predicts a continuous \(y\) in \(\left[0, 1\right]\), or the probability that \(y\) is in class \(1\), it is called "regression".
📗 The regression coefficients are usually estimated by \(w = \left(X^\top X\right)^{-1} X^\top y\), where \(X\) is the design matrix whose rows are the items and the columns are the features (a column of \(1\)s can be added so that the corresponding weight is the bias): Wikipedia.
📗 Gradient descent can be used with squared loss and the weights should converge to the same estimates: \(w = w - \alpha \left(\left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\right)\) and \(b = b - \alpha \left(\left(a_{1} - y_{1}\right) + \left(a_{2} - y_{2}\right) + ... + \left(a_{n} - y_{n}\right)\right)\).
Math Note
📗 The coefficients are sometimes derived from \(y = X w\), implying \(X^\top y = X^\top X w\) or \(w = \left(X^\top X\right)^{-1} X^\top y\).
📗 Gradient descent step is given by \(C_{i} = \dfrac{1}{2} \left(a_{i} - y_{i}\right)^{2}\) and \(a_{i} = w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m}\), so,
📗 Using linear regression to estimate probabilities is inappropriate, since the loss \(\dfrac{1}{2} \left(a_{i} - y_{i}\right)^{2}\) is large when \(y_{i} = 1\) and \(a_{i} > 1\) or \(y_{i} = 0\) and \(a_{i} < 0\), so linear regression is penalizing predictions that are "very correct": Wikipedia.
📗 Using linear regression and rounding \(y\) to the nearest integer is also inappropriate for multi-class classification, since the classes should not be ordered by their labels.
TopHat Discussion
ID:
📗 [1 points] Change the regression coefficients to minimize the loss.
📗 Given a new item \(x_{i'}\) (indexed by \(i'\)) with features \(\left(x_{i' 1}, x_{i' 2}, ..., x_{i' m}\right)\), the predicted \(y_{i'}\) is given by \(\hat{y}_{i'} = w_{1} x_{i' 1} + w_{2} x_{i' 2} + ... + w_{m} x_{i' m} + b\).
📗 The weight (coefficient) for feature \(j\) is usually interpreted as the expected (average) change in \(y_{i'}\) when \(x_{i' j}\) increases by one unit with the other features held constant.
📗 The bias (intercept) is usually interpreted as the expected (average) value of \(y\) when all features have value \(0\), or \(x_{i' 1} = x_{i' 2} = ... = x_{i' m} = 0\).
➩ This interpretation assumes that \(0\) is a valid value for all features (or \(0\) is in the range of all features).
testlr,lpq
📗 Notes and code adapted from the course taught by Professors Blerina Gkotse, Jerry Zhu, Yudong Chen, Yingyu Liang, Charles Dyer.
📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.
📗 Anonymous feedback can be submitted to: Form. Non-anonymous feedback and questions can be posted on Piazza: Link