📗 Classification is the problem when \(y\) is categorical.
➩ When \(y = \left\{0, 1\right\}\), the problem is binary classification.
➩ When \(y = \left\{0, 1, ..., K\right\}\), the problem is multi-class classification.
📗 Regression is the problem when \(y\) is continuous.
➩ Logistic regression is usually used for classification problems, but since it predicts a continuous \(y\) in \(\left[0, 1\right]\), or the probability that \(y\) is in class \(1\), it is called "regression".
📗 The regression coefficients are usually estimated by \(w = \left(X^\top X\right)^{-1} X^\top y\), where \(X\) is the design matrix whose rows are the items and the columns are the features (a column of \(1\)s can be added so that the corresponding weight is the bias): Wikipedia.
📗 Gradient descent can be used with squared loss and the weights should converge to the same estimates: \(w = w - \alpha \left(\left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\right)\) and \(b = b - \alpha \left(\left(a_{1} - y_{1}\right) + \left(a_{2} - y_{2}\right) + ... + \left(a_{n} - y_{n}\right)\right)\).
Math Note
📗 The coefficients are sometimes derived from \(y = X w\), implying \(X^\top y = X^\top X w\) or \(w = \left(X^\top X\right)^{-1} X^\top y\).
📗 Gradient descent step is given by \(C_{i} = \dfrac{1}{2} \left(a_{i} - y_{i}\right)^{2}\) and \(a_{i} = w_{1} x_{i 1} + w_{2} x_{i 2} + ... + w_{m} x_{i m}\), so,
➩ In statistics, MSE is usually used instead of squared loss, \(C = \dfrac{1}{n} \left(\left(a_{1} - y_{1}\right)^{2} + \left(a_{2} - y_{2}\right)^{2} + ... + \left(a_{n} - y_{n}\right)^{2}\right)\). The gradient descent formula is similar except the learning rate is scaled differently: \(\nabla_{w} C = \dfrac{1}{n} \left(\left(a_{1} - y_{1}\right) x_{1} + \left(a_{2} - y_{2}\right) x_{2} + ... + \left(a_{n} - y_{n}\right) x_{n}\right)\).
TopHat Discussion
ID:
📗 [1 points] Change the regression coefficients to minimize the loss.
📗 Using linear regression to estimate probabilities is inappropriate, since the loss \(\dfrac{1}{2} \left(a_{i} - y_{i}\right)^{2}\) is large when \(y_{i} = 1\) and \(a_{i} > 1\) or \(y_{i} = 0\) and \(a_{i} < 0\), so linear regression is penalizing predictions that are "very correct": Wikipedia.
📗 Using linear regression and rounding \(y\) to the nearest integer is also inappropriate for multi-class classification, since the classes should not be ordered by their labels.
TopHat Discussion
ID:
📗 [1 points] Change the regression coefficients to minimize the loss.
📗 Given a new item \(x_{i'}\) (indexed by \(i'\)) with features \(\left(x_{i' 1}, x_{i' 2}, ..., x_{i' m}\right)\), the predicted \(y_{i'}\) is given by \(\hat{y}_{i'} = w_{1} x_{i' 1} + w_{2} x_{i' 2} + ... + w_{m} x_{i' m} + b\).
📗 The weight (coefficient) for feature \(j\) is usually interpreted as the expected (average) change in \(y_{i'}\) when \(x_{i' j}\) increases by one unit with the other features held constant.
📗 The bias (intercept) is usually interpreted as the expected (average) value of \(y\) when all features have value \(0\), or \(x_{i' 1} = x_{i' 2} = ... = x_{i' m} = 0\).
➩ This interpretation assumes that \(0\) is a valid value for all features (or \(0\) is in the range of all features).
📗 A single perceptron (with possibly non-linear activation function) still produces a linear decision boundary (the two classes are separated by a line).
📗 Multiple perceptrons can be combined in a way that the output of one perceptron is the input of another perceptron:
➩ \(a^{\left(1\right)} = g\left(w^{\left(1\right)} x + b^{\left(1\right)}\right)\),
➩ A 2-layer network (1 hidden layer) can approximate any continuous function arbitrarily closely with enough hidden units.
➩ A 3-layer network (2 hidden layers) can approximate any function arbitrarily closely with enough hidden units.
TopHat Discussion
📗 Try different combinations of activation function and network architecture (number of layers and units in each layer), and compare which ones are good for the spiral dataset: Link.
📗 The perceptrons will be organized in layers: \(l = 1, 2, ..., L\) and \(m^{\left(l\right)}\) units will be used in layer \(l\).
➩ \(w_{j k}^{\left(l\right)}\) is the weight from unit \(j\) in layer \(l - 1\) to unit \(k\) in layer \(l\), and in the output layer, there is only one unit, so the weights are \(w_{j}^{\left(L\right)}\).
➩ \(b_{j}^{\left(l\right)}\) is the bias for unit \(j\) in layer \(l\), and in the output layer, there is only one unit, so the bias is \(b^{\left(L\right)}\).
➩ \(a_{i j}^{\left(l\right)}\) is the activation for training item \(i\) unit \(j\) in layer \(l\), where \(a_{i j}^{\left(0\right)} = x_{i j}\) can be viewed as unit \(j\) in layer \(0\) (alternatively, \(a_{i j}^{\left(l\right)}\) can be viewed as internal features), and \(a_{i}^{\left(L\right)}\) is the output representing the predicted probability that \(x_{i}\) belongs to class \(1\) or \(\mathbb{P}\left\{\hat{y}_{i} = 1\right\}\).
📗 The way the hidden (internal) units are connected is called the architecture of the network.
📗 In a fully connected network, all units in layer \(l\) are connected to every unit in layer \(l - 1\).
📗 [1 points] The following is a diagram of a neural network: highlight an edge (mouse or touch drag from one node to another node) to see the name of the weight (highlight the same edge to hide the name). Highlight color: .
Name of input units: 4
Name of hidden layer 1 units: 3
Name of hidden layer 2 units: 2
Name of hidden layer 3 units: 0
Name of output units: 1 1slider
TopHat Quiz
(Past Exam Question) ID:
📗 [4 points] Given the following neural network that classifies all the training instances correctly. What are the labels (0 or 1) of the training data? The activation functions are LTU for all units: \(1_{\left\{z \geq 0\right\}}\). The first layer weight matrix is , with bias vector , and the second layer weight vector is , with bias
\(x_{i1}\)
\(x_{i2}\)
\(y_{i}\) or \(a^{\left(2\right)}_{1}\)
0
0
?
0
1
?
1
0
?
1
1
?
Note: if the weights are not shown clearly, you could move the nodes around with mouse or touch.
📗 Answer (comma separated vector): .
testlr,lp,nnd,annq
📗 Notes and code adapted from the course taught by Professors Blerina Gkotse, Jerry Zhu, Yudong Chen, Yingyu Liang, Charles Dyer.
📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.
📗 Anonymous feedback can be submitted to: Form. Non-anonymous feedback and questions can be posted on Piazza: Link