Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add V-MPO #194

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

AlexanderKoch-Koch
Copy link
Contributor

This adds support for the V-MPO algorithm which was published at ICLR 2020 by DeepMind (https://openreview.net/forum?id=SylOlp4FvH). As far as I know, this is the first public implementation. Unfortunately, my computational budget is very constrained. I was only able to make 1 run on Ant-v3 for a bit over 1 Billion steps. It matches the results in the paper very closely.
However, the paper was a bit unclear on some aspects:

  1. The Batch size and model architecture were missing for the Ant environment. I've used a batch size of 64 and a simple feed-forward architecture with separate policy and value networks. I think using separate networks is quite important. However, I would like to test this again. Furthermore, one of the authors told me to use tanh(layer_norm(obs)) for the input. This has improved the performance by a bit.

  2. The paper didn't mention the type of gaussian parameters that were used. One of the authors said, they used the mean and diagonal covariance with a softmax. I have also tried to use the existing rlpyt implementation with log_std network output and the exp for the std. However, this seems to be slightly worse. In this pull request both versions exist. Some lines have to be (un)commented in the model, agent and continuous_action_loss in the algo.

Here some stuff that is still a bit ugly:
The number of input_features in the PopArt layer has to be changed manually to match the size of the last hidden layer of the value network.

I am using directly the Normal distribution from torch to use the softmax for std. Maybe the gaussian distribution in rlpyt should have an option for softmax vs exp for the std. And it could maybe implement the decoupled KL as is used in V-MPO. This would simplify the code in the V-MPO loss function.

This version uses the MinibatchRlEval runner. And it has to use a very large Batch_B in order to mimic the T_Target_steps in the asynchronous version described in the paper. For Ant-v3 this means 6400 environments which uses about 40GB of RAM. This problem was also mentioned in #193.

I have also developed an asynchronous version of V-MPO. However, the problem here is to synchronize the weights at the right time. Currently, the weights are updated slightly after the sampler has starts to sample for the next T_target_steps. Maybe the sampler should update its weights more often



Here is the average return with the number of environment steps on the x axis. This is on Ant-v3 instead of Ant-v1 as in the paper. However, in my experience they behave almost the same.
Return_Average (1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant