Add V-MPO #194

AlexanderKoch-Koch · 2020-10-28T08:56:21Z

This adds support for the V-MPO algorithm which was published at ICLR 2020 by DeepMind (https://openreview.net/forum?id=SylOlp4FvH). As far as I know, this is the first public implementation. Unfortunately, my computational budget is very constrained. I was only able to make 1 run on Ant-v3 for a bit over 1 Billion steps. It matches the results in the paper very closely.
However, the paper was a bit unclear on some aspects:

The Batch size and model architecture were missing for the Ant environment. I've used a batch size of 64 and a simple feed-forward architecture with separate policy and value networks. I think using separate networks is quite important. However, I would like to test this again. Furthermore, one of the authors told me to use tanh(layer_norm(obs)) for the input. This has improved the performance by a bit.
The paper didn't mention the type of gaussian parameters that were used. One of the authors said, they used the mean and diagonal covariance with a softmax. I have also tried to use the existing rlpyt implementation with log_std network output and the exp for the std. However, this seems to be slightly worse. In this pull request both versions exist. Some lines have to be (un)commented in the model, agent and continuous_action_loss in the algo.

Here some stuff that is still a bit ugly:
The number of input_features in the PopArt layer has to be changed manually to match the size of the last hidden layer of the value network.

I am using directly the Normal distribution from torch to use the softmax for std. Maybe the gaussian distribution in rlpyt should have an option for softmax vs exp for the std. And it could maybe implement the decoupled KL as is used in V-MPO. This would simplify the code in the V-MPO loss function.

This version uses the MinibatchRlEval runner. And it has to use a very large Batch_B in order to mimic the T_Target_steps in the asynchronous version described in the paper. For Ant-v3 this means 6400 environments which uses about 40GB of RAM. This problem was also mentioned in #193.

I have also developed an asynchronous version of V-MPO. However, the problem here is to synchronize the weights at the right time. Currently, the weights are updated slightly after the sampler has starts to sample for the next T_target_steps. Maybe the sampler should update its weights more often

Here is the average return with the number of environment steps on the x axis. This is on Ant-v3 instead of Ant-v1 as in the paper. However, in my experience they behave almost the same.

Add V-MPO

8b86b5f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add V-MPO #194

Add V-MPO #194

AlexanderKoch-Koch commented Oct 28, 2020

Add V-MPO #194

Are you sure you want to change the base?

Add V-MPO #194

Conversation

AlexanderKoch-Koch commented Oct 28, 2020