Reinforcement Learning - Policy Gradient Algorithms

The algorithms that popularized RLHF for language models were policy-gradient methods. When RLHF became popular with ChatGPT, it was largely known that they used a variant of PPO, and many initial methods were built around that. Over time, multiple research projects have showed the promise of REINFORCE style algorithms, touted for its simplicity over PPO without a reward model (saves memory and thus GPUs required), and with simpler value estimation.

Policy Gradient Algorithms

The objective of the agent in reinforcement learning is to maximize its discounted cumulative future reward return, defined as:

\begin{equation} Gt = R{t+1} + \gamma R{t+2} + \cdots = \sum{k=0}^{\infty} \gamma^k R_{t+k+1} \end{equation}

where $\gamma \in [0, 1]$ is the discount factor which prioritizes near term rewards over long-term rewards.

The return can also be defined in terms of future return as:

\begin{equation} Gt = R{t+1} + \gamma G_{t+1} \end{equation}

This return is the basis for learning a value function $V(s)$ that is the estimated future return from a given current state $s$ \begin{equation} V(s) = \mathbb{E}[G_t \mid S_t = s] \end{equation}

Policy gradient methods solve for such a value function induced by a policy $\pi(a \mid s)$.

Where $d_{\pi}(s)$ is the stationary distribution of states under policy $\pi(a \mid s)$, the optimization objective is defined as:

\begin{equation} J(\theta) = \sum{s} d{\pi})(s) V_{\pi}(s) \end{equation}

The core of policy gradient methods is computing the gradient with respect to the finite time expected return over the current policy. With this expected return, $J$, the parameter update can be computed as follows, where $\alpha$ is the learning rate:

\begin{equation} \theta \leftarrow \theta + \alpha \nabla_{\theta} J(\theta) \end{equation}

The core implementation detail is how to computed the policy gradient $\nabla_{\theta} J(\theta)$.

Another way to pose the RL objective we want to maximize is the expected return over trajectories $\tau$ sampled from the policy $\pi_{\theta}$:

\begin{equation} J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} [R(\tau)] \end{equation}

where $\tau = (s_0, a_0, s_1, a_1, \ldots)$ is a trajectory and $R(\tau) = \sum_{t=0}^{\infty} r_t$ is the total reward of the trajectory. We can write the expectation as an integral over all possible trajectories:

\begin{equation} J(\theta) = \int{\tau} p{\theta}(\tau) R(\tau) d\tau \end{equation}

Now, notice we can express the trajectory distribution $p_{\theta}(\tau)$ as:

\begin{equation} p{\theta}(\tau) = p(s_0) \prod{t=0}^{\infty} \pi{\theta}(a_t \mid s_t) p(S{t+1} \mid s_t, a_t) \end{equation}

If we take the gradient of $J(\theta)$ with respect to the policy parameters $\theta$, we get:

\begin{equation} \nabla{\theta} J(\theta) = \int{\tau} \nabla{\theta} p{\theta}(\tau) R(\tau) d\tau \end{equation}

Reinforcement Learning - Policy Gradient Algorithms

Policy Gradient Algorithms

Enjoy Reading This Article?