Young Wu's Homepage

Prev: W8, Next: W10

# Overview

📗 Readings: MARL Chapter 8.

📗 Wikipedia page: Link

# Deep Q Network

📗 Neural networks can be trained on offline data sets using gradient descent to minimize some loss function, \(C\left(Q\left(s, a\right), \hat{Q}\left(s, a; w, b\right)\right)\).

➩ Squared loss: \(C\left(Q, \hat{Q}\right) = \dfrac{1}{2} \left(Q - \hat{Q}\right)^{2}\).

➩ Cross-entropy loss: \(C\left(Q, \hat{Q}\right) = - Q \log\left(\hat{Q}\right) - \left(1 - Q\right) \log\left(1 - \hat{Q}\right)\).

➩ Huber loss: \(C\left(Q, \hat{Q}\right) = \dfrac{1}{2} \left(Q - \hat{Q}\right)^{2}\) if \(\left| Q - \hat{Q} \right| < \delta\) and \(C\left(Q, \hat{Q}\right) = \left| Q - \hat{Q} \right| \delta - \dfrac{1}{2} \delta^{2}\) otherwise.

📗 Gradient descent then can be used to compute the weights iteratively: \(w = w - \lambda \dfrac{\partial C}{\partial w}\) and \(b = b - \lambda \dfrac{\partial C}{\partial b}\) for every weight and bias.

📗 Neural network can be trained using genetic algorithm too.

📗 For Q learning, gradient descent can be combined with Q iteration during online learning: one of the algorithms is called Deep Q Network with experience replay (DQN), where the cost function is given by \(C\left(r_{t} + \gamma \displaystyle\max_{a} \hat{Q}\left(s_{t+1}, a; w, b\right), \hat{Q}\left(s_{t}, a_{t}; w, b\right)\right)\).

📗 Use the same network for \(\hat{Q}\left(s_{t+1}, a; w, b\right)\) and \(\hat{Q}\left(s_{t}, a; w, b\right)\) is unstable, so two different networks are used instead: the Q network \(\hat{Q}\left(s_{t}, a; w, b\right)\) and the target network \(\hat{Q}\left(s_{t+1}, a; \hat{w}, \hat{b}\right)\).

📗 After every iteration (trained on batch data), the target network is synced with the Q network for the next iteration.

Last Updated: September 17, 2025 at 12:30 PM