Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds support for the V-MPO algorithm which was published at ICLR 2020 by DeepMind (https://openreview.net/forum?id=SylOlp4FvH). As far as I know, this is the first public implementation. Unfortunately, my computational budget is very constrained. I was only able to make 1 run on Ant-v3 for a bit over 1 Billion steps. It matches the results in the paper very closely.
However, the paper was a bit unclear on some aspects:
The Batch size and model architecture were missing for the Ant environment. I've used a batch size of 64 and a simple feed-forward architecture with separate policy and value networks. I think using separate networks is quite important. However, I would like to test this again. Furthermore, one of the authors told me to use tanh(layer_norm(obs)) for the input. This has improved the performance by a bit.
The paper didn't mention the type of gaussian parameters that were used. One of the authors said, they used the mean and diagonal covariance with a softmax. I have also tried to use the existing rlpyt implementation with log_std network output and the exp for the std. However, this seems to be slightly worse. In this pull request both versions exist. Some lines have to be (un)commented in the model, agent and continuous_action_loss in the algo.
Here some stuff that is still a bit ugly:
The number of input_features in the PopArt layer has to be changed manually to match the size of the last hidden layer of the value network.
I am using directly the Normal distribution from torch to use the softmax for std. Maybe the gaussian distribution in rlpyt should have an option for softmax vs exp for the std. And it could maybe implement the decoupled KL as is used in V-MPO. This would simplify the code in the V-MPO loss function.
This version uses the MinibatchRlEval runner. And it has to use a very large Batch_B in order to mimic the T_Target_steps in the asynchronous version described in the paper. For Ant-v3 this means 6400 environments which uses about 40GB of RAM. This problem was also mentioned in #193.
I have also developed an asynchronous version of V-MPO. However, the problem here is to synchronize the weights at the right time. Currently, the weights are updated slightly after the sampler has starts to sample for the next T_target_steps. Maybe the sampler should update its weights more often
Here is the average return with the number of environment steps on the x axis. This is on Ant-v3 instead of Ant-v1 as in the paper. However, in my experience they behave almost the same.