Young Wu's Homepage

Prev: W3, Next: W5

# Gradient Methods

📗 Gradient descent can be used to minimize the loss from a neural network. The weights can be updated using the gradient descent formula \(w = w - \alpha \dfrac{\partial C}{\partial w}\) and \(b = b - \alpha \dfrac{\partial C}{\partial b}\), for some learning rate \(\alpha\).

📗 The gradient (partial derivatives) for a neural network can be computed using chain rule.

📗 For a neural network with \(m\) input units, 1 hidden layer with \(H\) hidden units, and 1 output unit.

➩ Compute the activation for training item \(i\): \(h_{i j} = g\left(\displaystyle\sum_{l=1}^{m} w_{l j} x_{i l}\right)\) and \(o_{i} = g\left(\displaystyle\sum_{j=1}^{H} w_{j} h_{i j}\right)\) where \(g\left(z\right) = \dfrac{1}{1 + e^{-z}}\) and \(g'\left(z\right) = g\left(z\right) \cdot \left(1 - g\left(z\right)\right)\).

➩ Compute the derivatives for the second layer weights and biases: \(\dfrac{\partial C}{\partial w_{j}} = \displaystyle\sum_{i=1}^{n} \dfrac{\partial C_{i}}{\partial o_{i}} \dfrac{\partial o_{i}}{\partial w_{j}} = \displaystyle\sum_{i=1}^{n} \left(o_{i} - y_{i}\right) h_{i j}\) and \(\dfrac{\partial C}{\partial b} = \displaystyle\sum_{i=1}^{n} \left(o_{i} - y_{i}\right)\) where \(C\left(o_{i}, y_{i}\right) = y_{i} \log\left(o_{i}\right) + \left(1 - y_{i}\right) \log\left(1 - o_{i}\right)\).

➩ Compute the derivatives for the first layer weights and biases: \(\dfrac{\partial C}{\partial w_{l j}} = \displaystyle\sum_{i=1}^{n} \dfrac{\partial C_{i}}{\partial o_{i}} \dfrac{\partial o_{i}}{\partial h_{j}} \dfrac{\partial h_{j}}{\partial w_{l j}} = \displaystyle\sum_{i=1}^{n} \left(o_{i} - y_{i}\right) w_{j} h_{j} \left(1 - h_{j}\right) x_{i l}\) and \(\dfrac{\partial C}{\partial b_{j}} = \displaystyle\sum_{i=1}^{n} \left(o_{i} - y_{i}\right) w_{j} h_{j} \left(1 - h_{j}\right)\).

📗 For a neural network with \(m\) input units, 1 hidden layer with \(H\) hidden units, and \(K\) output unit.

➩ Compute the activation for training item \(i\): \(h_{i j} = g\left(\displaystyle\sum_{l=1}^{m} w_{l j} x_{i l}\right)\) and \(o_{i k} = f\left(\displaystyle\sum_{j=1}^{H} w_{j k} h_{i j}\right)\) where \(f\left(z_{k}\right) = \dfrac{e^{-z_{k}}}{\displaystyle\sum_{l=1}^{K} e^{-z_{l}}}\) and \(f'\left(z_{k}\right) = f\left(z_{k}\right) \cdot \left(1 - f\left(z_{k}\right)\right)\).

➩ Compute the derivatives for the second layer weights and biases: \(\dfrac{\partial C}{\partial w_{j k}} = \displaystyle\sum_{i=1}^{n} \displaystyle\sum_{k=1}^{K} \dfrac{\partial C_{i}}{\partial o_{i k}} \dfrac{\partial o_{i k}}{\partial w_{j k}} = \displaystyle\sum_{i=1}^{n} \left(o_{i k} - y_{i k}\right) h_{i j}\) and \(\dfrac{\partial C}{\partial b_{j}} = \displaystyle\sum_{i=1}^{n} \displaystyle\sum_{k=1}^{K} o_{i k} - y_{i k}\) where \(C\left(o_{i}, y_{i}\right) = \displaystyle\sum_{k=1}^{K} y_{i k} \log\left(o_{i k}\right)\), and \(y_{i k} = 1_{\left\{y_{i} = k\right\}}\).

➩ Compute the derivatives for the first layer weights and biases: \(\dfrac{\partial C}{\partial w_{l j}} = \displaystyle\sum_{i=1}^{n} \displaystyle\sum_{k=1}^{K} \dfrac{\partial C_{i}}{\partial o_{i k}} \dfrac{\partial o_{i k}}{\partial h_{j}} \dfrac{\partial h_{j}}{\partial w_{l j}} = \displaystyle\sum_{i=1}^{n} \displaystyle\sum_{k=1}^{K} \left(o_{i k} - y_{i k}\right) w_{j k} h_{j} \left(1 - h_{j}\right) x_{i l}\) and \(\dfrac{\partial C}{\partial b_{j}} = \displaystyle\sum_{i=1}^{n} \displaystyle\sum_{k=1}^{K} \left(o_{i k} - y_{i k}\right) w_{j k} h_{j} \left(1 - h_{j}\right)\).

Last Updated: July 01, 2025 at 1:46 AM