Prev: W6, Next: W8

# Policy Gradient

📗 \(\hat{\pi}\) can be represented by a neural network itself (with or without the Q network), and can be chosen to maximize the total reward \(V\left(\pi\right) = \mathbb{E}\left[\displaystyle\sum_{t=1^\top} r_{t}\right]\).
📗 When \(\pi\left(s; w, b\right)\) is parameterized by a neural network, the derivative can be computed by the Policy Gradient Theorem: \(\dfrac{\partial V}{\partial \left(w, b\right)} = \mathbb{E}\left[\displaystyle\sum_{t=1}^{T} \dfrac{\partial \log \pi\left(s_{t} ; w, b\right)}{\partial \left(w, b\right)} \displaystyle\sum_{t'=t}^{T} \beta^{t' - t} r_{t'}\right]\), and gradient ascent (for maximization) can be applied to \(\left(w, b\right) = \left(w, b\right) + \alpha \dfrac{\partial V}{\partial \left(w, b\right)}\). This algorithm is called the REINFORCE algorithm: REward Increment = Non-negative Factor x Offset reinforcement x Characteristic Eligibility.
📗 The REINFORCE algorithm computes the gradient based on the loss \(-\dfrac{1}{T} \displaystyle\sum_{t=0}^{T} \left(\displaystyle\sum_{t' = t}^{T} \beta^{t' - t} r_{t'}\right) \log \pi\left(s_{t} ; w, b\right)\).

📗 \(\displaystyle\sum_{t'=t}^{T} r_{t'}\) is similar to \(Q\left(s_{t}, a_{t}\right)\) and can be replaced by \(Q\left(s_{t}, a_{t}\right) = r\left(s_{t}, a_{t}\right) + \beta V\left(s_{t+1}\right) - V\left(s_{t}\right)\). In this case, \(V\) can be represented by another neural network (critic network), and combined with the \(\pi\) network (actor network), the algorithm is called the actor-critic (A2C or Advantage Actor Critic) algorithm.
📗 The actor-critic networks can be combined into one network so they have shared hidden units (hidden features).
📗 A2C algorithm computes the gradient based on the loss: (for actor) \(-\left(r_{t} + \beta V\left(s_{t+1} ; w, b\right) - V\left(s_{t} ; w, b\right)\right) \log \pi\left(s_{t} ; w, b\right)\) and (for critic) \(\left(r_{t} + \beta V\left(s_{t+1} ; w, b\right) - V\left(s_{t} ; w, b\right)\right)^{2}\).





Last Updated: July 16, 2024 at 11:51 AM