Which policy gradient algorithm are you using?
CONTENTS
So, you probably all know the formula for updating the policy network using the policy gradient theorem:
Here, the action
You’ve probably also heard that there are different algorithms for computing the
return
Disclaimer: The code snippets given in this blog post will probably not work directly out of the box. The full working code as well as results on training these agents on the Atari game LunarLander can be seen on github.
VANILLA POLICY GRADIENT⌗
In vanilla policy gradient (VPG) there is nothing fancy; we just calculate the
return
Finally we compute the gradient:
Each of the states was visited using the current policy and the state distribution function of the environment, so for each state-action pair we can compute an estimate of the gradient. Then, we average over the samples, much like a mini-batch update in supervised learning.
To implement this algorithm we will first make use of a function that describes
an agent-environment loop. We will assume that env
is a non-vectorized
environment that conforms to the OpenAI Gym
API.
The environment loop will run multiple iterations and on each iteration it will rollout an episode using the current policy, and then it will update the policy using the sampled data.
The returns are calculated by simply adding the rewards obtained from the given time-step onwards. I’ve used a simple vector-matrix multiplication here where I multiply the rewards vector by a lower-triangular matrix of ones in order to get the desired sums.
One trick that is usually done in practice is to normalize the returns before computing the gradient. This is done in order to stabilize the training of the neural network. Note that this modification does not change the expectation of the gradient estimation.
VPG WITH BASELINE⌗
Since the single-sample Monte-Carlo estimate of the returns might have a lot of
variance, we may want to perform multiple rollouts instead of just one. That is,
rollout several episodes and only then calculate the gradient and backpropagate.
However, it is not clear how helpful that would be because you will still get
(mostly) single-sample Monte-Carlo estimates; it’s just that you will have more
data points in the batch. To see why is that, consider rolling out a second
episode with the same policy. At some step
A much better approach to reduce the variance of the estimation would be to add a baseline to the estimate of the return.
We will baseline the returns using the value function
and the baseline that we are using is
The formula for the gradient now becomes:
where
The value function is approximated using a second neural network which is trained concurrently with the policy. The update function will be modified like this:
Note that even though we baseline the returns we are still normalizing the
advantages after that. This effectively means that we are using a different
baseline and not
ONLINE ADVANTAGE ACTOR-CRITIC⌗
Now that we have a value network we can actually compute the return by bootstrapping instead of using a Monte-Carlo estimate:
This is the so-called actor-critic setup, where we have an actor (the policy network) that selects actions to perform the rollout and a critic (the value network) that is used to compute the returns, i.e. it grades the performance.
Combining the actor-critic with a baseline we get the advantage actor-critic (A2C2). The formula for the gradient now becomes:
Here
With this setup we actually don’t have to run episodes until the end. In fact we can update the policy (and the value network) at every single step. But that is not a very good idea because we will be estimating the gradient using a single sample. Instead, what we could do is:
- either run multiple environments in parallel in order to obtain multiple samples at every step,
- or rollout the episode for several steps and only then perform the update.
We will actually do both.
We will now assume that env
is a vectorized
environment. The environment loop
will run multiple iterations and on each iteration it will rollout the policy
for a fixed amount of steps. Note that vectorized environments will auto-reset
any sub-environment that is terminated or truncated. This means that a given
segment of experiences might contain data from multiple episodes, and for this
reason we will use the done
tensor to indicate which of the states are
terminal states.
Once we update, we continue stepping the vectorized environment from where we left off, i.e. the environment is not reset at the beginning of the new iteration. This is actually very helpful in the case of long horizon tasks where episodes could be extremely long (think 100K steps).
The update function is not much different, but note that it now accepts additional parameters:
Here we can see how we are actually using the information provided by done
: in
case our agent steps a sub-environment from its current state into a terminal
state, then we should treat the obtained reward as the true return for that
state - no bootstrapping.
MULTI-STEP A2C⌗
The A2C algorithm provides additional variance reduction by bootstrapping the estimate of the gradient, but it also adds bias to the estimate. While the vanilla policy gradient provides an un-biased estimator for the gradient, the A2C is not.
With the current setup there is one very obvious improvement that we could do to reduce the bias of the gradient estimate. Note that we are rolling out the policy for multiple steps, but we are computing the return using a single-step bootstrap. What we could do instead is use an n-step bootstrap estimation: the return for each state will be computed by summing all the rewards along the current trajectory and only at the end we will bootstrap:
The computation of the returns might seem a bit confusing at first, because it runs the loop in reverse, starting from the end. We only need to bootstrap at the end for non-terminal states, and then we just keep adding the rewards from the previous steps to the current estimate of the return. If at some point we reach a terminal state, then the return is simply the obtained reward - no bootstrapping.
GAE⌗
Note that estimating the return using an n-step bootstrap reduces the bias, but
increases the variance of the estimate. There is an obvious trade-off between
bias and variance when choosing how large
An effective approach to reduce the variance and perfectly balance the trade-off
is the Generalized Advantage
Estimation(GAE)3. The formula is very
similar to
Here
To get the generalized advantage estimator we simply subtract the value of
Let us define
Finally, for the GAE we get:
Again, I am glossing over the details of how this formula is derived, and why it works, but if you want to read more I suggest checking ot this blog post.
To use GAE in our A2C agent we simply have to modify the way we estimate the advantages:
Again, the computation of the advantages might seem a bit confusing, but it follows a simple logic:
- bootstrap at the end for non-terminal states,
- at each step compute the current
by making sure not to bootstrap on terminal states, - compute the advantage by adding the
to the running sum of deltas decayed by . Again make sure not to add anything if at a terminal state.
Since we are directly computing the advantages, the returns are actually derived
by reversing the equation:
PPO⌗
Note that all of the policy updates that we did until now are simple “vanilla” updates: we simply update the policy once, using the estimate of the gradient, and then we discard the collected data. However, there is nothing stopping us from using a more sophisticated update algorithm. In fact, for the experiments in the GAE paper the authors use the TRPO4 update rule.
(Not really shocking news, given that both TRPO and GAE are authored by the same people.)
Here we will take a look at PPO5. The problem that the authors are trying to solve is to come up with an algorithm that allows us to take the biggest possible update step on the policy parameters before throwing out the collected rollout data. The approach taken in this paper is to allow for multiple update steps that, when combined, would approximate this maximum possible update.
Note that after we update the policy parameters once,
It seems like we could compute the objective to update the new policy weights using the data collected with the old policy weights, as long as we correct with the importance sampling weight:
However, note that, in order to compute the correct gradient estimate, the
actions have to be sampled under
How bad is that?
It turns out that if
There are two different proximal policy algorithms each using a different
heuristic to try to ensure that
PPO-Penalty - constraints the KL divergence between the two distributions by adding it as a penalty to the objective:
PPO-CLIP - clips the objective function if
deviates too much from :
The algorithm implemented here is PPO-CLIP augmented with a check for early
stopping. We will split the collected rollout data into mini-batches and we will
iterate for several epochs over the batches. At the end of every epoch we will
check the KL divergence between the original policy
Note that the advantages are normalized at the mini-batch level, in the beginning of every iteration before computing the loss. This should come as no surprise, because as we mentioned earlier, the goal of this normalization is to stabilize the gradient updates of the neural network. It has nothing to do with the baseline for the return.
In addition to clipping the objective for the policy, we are also clipping the value function loss before updating the parameters of the value network. This approach was first introduced in the GAE paper3 (see Section 5), and was later also applied in the official PPO implementation.
Finally, to compute the KL divergence between the new and the old policy we
use the following equality:
2013 “The optimal reward baseline for gradient-based reinforcement learning” by Lex Weaver, Nigel Tao ↩︎
2016 “Asynchronous methods for deep reinforcement learning” by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu ↩︎
2015 “High-dimensional continuous control using generalized advantage estimation” by John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel ↩︎ ↩︎
2015 “Trust Region Policy Optimization” by John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel ↩︎ ↩︎
2017 “Proximal Policy Optimization Algorithms” by John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov ↩︎