Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDPG + HER - ParkingEnv-v0 #15

Closed
araffin opened this issue Apr 21, 2019 · 4 comments
Closed

DDPG + HER - ParkingEnv-v0 #15

araffin opened this issue Apr 21, 2019 · 4 comments

Comments

@araffin
Copy link
Contributor

araffin commented Apr 21, 2019

Hello,

I'm currently checking performance on ParkingEnv of a new HER implementation for stable-baselines (see hill-a/stable-baselines#273) and I was wondering what hyperparameters did you use for that environment?

Especially, how many steps, and what were ddpg hyperparams, her hyperparams (and which implementations)?
I'm also interested in knowing what was the best mean reward achieved in your experiment ;)

Currently, after 1e6 steps, with default hyperparams, normal noise of std 0.15, using 'future" goal selection strategy with k=4, I got a mean reward around 9.
The learned policy looks ok but not as good as your result.

PS: It seems that you are using a deprecated feature of gym, but I can open another issue for that
the warning:

warnings.warn("DEPRECATION WARNING wrapper_config.TimeLimit has been deprecated. 
Replace any calls to `register(tags={'wrapper_config.TimeLimit.max_episode_steps': 200)}`
with `register(max_episode_steps=200)`.
This change was made 2017/1/31 and is included in gym version 0.8.0.
If you are getting many of these warnings, you may need to update switch from universe 0.21.3 to retro (https://github.com/openai/retro)")
@eleurent
Copy link
Collaborator

Hi Antonin,
I'm using the original https://github.com/openai/baselines implementation with default hyperparameters for both her and ddpg and with num_timesteps=1e4.
You can try the following test script: /scripts/baselines_run.py

Here is a sample output, listing the default hyperparameters and training stats:

{'load_path': '~/models/latest', 'network': 'default'}
T: 20
_Q_lr: 0.001
_action_l2: 1.0
_batch_size: 256
_buffer_size: 1000000
_clip_obs: 200.0
_hidden: 256
_layers: 3
_max_u: 1.0
_network_class: baselines.her.actor_critic:ActorCritic
_norm_clip: 5
_norm_eps: 0.01
_pi_lr: 0.001
_polyak: 0.95
_relative_goals: False
_scope: ddpg
aux_loss_weight: 0.0078
bc_loss: 0
ddpg_params: {'buffer_size': 1000000, 'hidden': 256, 'layers': 3, 'network_class': 'baselines.her.actor_critic:ActorCritic', 'polyak': 0.95, 'batch_size': 256, 'Q_lr': 0.001, 'pi_lr': 0.001, 'norm_eps': 0.01, 'norm_clip': 5, 'max_u': 1.0, 'action_l2': 1.0, 'clip_obs': 200.0, 'scope': 'ddpg', 'relative_goals': False}
demo_batch_size: 128
env_name: highway-parking-v0
gamma: 0.95
make_env: <function prepare_params.<locals>.make_env at 0x00000182FE639598>
n_batches: 40
n_cycles: 50
n_test_rollouts: 10
noise_eps: 0.2
num_demo: 100
prm_loss_weight: 0.001
q_filter: 0
random_eps: 0.3
replay_k: 4
replay_strategy: future
rollout_batch_size: 1
test_with_polyak: False

*** Warning ***
You are running HER with just a single MPI worker. This will work, but the experiments that we report in Plappert et al. (2018, https://arxiv.org/abs/1802.09464) were obtained with --num_cpu 19. This makes a significant difference and if you are looking to reproduce those results, be aware of this. Please also refer to https://github.com/openai/baselines/issues/314 for further details.
****************

Creating a DDPG agent with action space 2 x 1.0...
Training...
--------------------------------------
| epoch              | 0             |
| stats_g/mean       | -0.0012410134 |
| stats_g/std        | 0.071867436   |
| stats_o/mean       | -0.0004943212 |
| stats_o/std        | 0.063166946   |
| test/episode       | 10.0          |
| test/mean_Q        | -2.371805     |
| test/success_rate  | 0.4           |
| train/episode      | 50.0          |
| train/success_rate | 0.0           |
--------------------------------------
--------------------------------------
| epoch              | 1             |
| stats_g/mean       | -0.0012013107 |
| stats_g/std        | 0.072002545   |
| stats_o/mean       | -0.0004833463 |
| stats_o/std        | 0.06337828    |
| test/episode       | 20.0          |
| test/mean_Q        | -2.3061156    |
| test/success_rate  | 0.3           |
| train/episode      | 100.0         |
| train/success_rate | 0.04          |
--------------------------------------
---------------------------------------
| epoch              | 2              |
| stats_g/mean       | -0.0011480862  |
| stats_g/std        | 0.072064586    |
| stats_o/mean       | -0.00043134554 |
| stats_o/std        | 0.06350243     |
| test/episode       | 30.0           |
| test/mean_Q        | -2.392759      |
| test/success_rate  | 0.8            |
| train/episode      | 150.0          |
| train/success_rate | 0.0            |
---------------------------------------
---------------------------------------
| epoch              | 3              |
| stats_g/mean       | -0.0010096101  |
| stats_g/std        | 0.072232425    |
| stats_o/mean       | -0.00035306226 |
| stats_o/std        | 0.06374396     |
| test/episode       | 40.0           |
| test/mean_Q        | -1.9268663     |
| test/success_rate  | 0.6            |
| train/episode      | 200.0          |
| train/success_rate | 0.04           |
---------------------------------------
---------------------------------------
| epoch              | 4              |
| stats_g/mean       | -0.0011195819  |
| stats_g/std        | 0.072448455    |
| stats_o/mean       | -0.00047702927 |
| stats_o/std        | 0.06399239     |
| test/episode       | 50.0           |
| test/mean_Q        | -1.815284      |
| test/success_rate  | 0.8            |
| train/episode      | 250.0          |
| train/success_rate | 0.0            |
---------------------------------------
---------------------------------------
| epoch              | 5              |
| stats_g/mean       | -0.001369102   |
| stats_g/std        | 0.07245933     |
| stats_o/mean       | -0.00071890303 |
| stats_o/std        | 0.064110346    |
| test/episode       | 60.0           |
| test/mean_Q        | -1.8090712     |
| test/success_rate  | 1.0            |
| train/episode      | 300.0          |
| train/success_rate | 0.08           |
---------------------------------------
--------------------------------------
| epoch              | 6             |
| stats_g/mean       | -0.0013151062 |
| stats_g/std        | 0.07252834    |
| stats_o/mean       | -0.0006881503 |
| stats_o/std        | 0.0642819     |
| test/episode       | 70.0          |
| test/mean_Q        | -1.9746447    |
| test/success_rate  | 0.9           |
| train/episode      | 350.0         |
| train/success_rate | 0.06          |
--------------------------------------
--------------------------------------
| epoch              | 7             |
| stats_g/mean       | -0.0013556568 |
| stats_g/std        | 0.07271606    |
| stats_o/mean       | -0.0007181207 |
| stats_o/std        | 0.0645066     |
| test/episode       | 80.0          |
| test/mean_Q        | -2.0683777    |
| test/success_rate  | 1.0           |
| train/episode      | 400.0         |
| train/success_rate | 0.06          |
--------------------------------------
---------------------------------------
| epoch              | 8              |
| stats_g/mean       | -0.0014176448  |
| stats_g/std        | 0.07286355     |
| stats_o/mean       | -0.00080151745 |
| stats_o/std        | 0.06478638     |
| test/episode       | 90.0           |
| test/mean_Q        | -2.0549748     |
| test/success_rate  | 0.9            |
| train/episode      | 450.0          |
| train/success_rate | 0.02           |
---------------------------------------
--------------------------------------
| epoch              | 9             |
| stats_g/mean       | -0.0013835019 |
| stats_g/std        | 0.07291278    |
| stats_o/mean       | -0.0008103596 |
| stats_o/std        | 0.064902104   |
| test/episode       | 100.0         |
| test/mean_Q        | -1.9726061    |
| test/success_rate  | 1.0           |
| train/episode      | 500.0         |
| train/success_rate | 0.04          |
--------------------------------------```

Does that help?

Thanks for letting me know of that deprecated warning, I'm probably still using an old version of gym >_<

@araffin
Copy link
Contributor Author

araffin commented Apr 22, 2019

Ok, thanks this should be enough ;), and using how many workers?

EDIT: it was one apparently (just saw that in the logs)

@araffin
Copy link
Contributor Author

araffin commented Apr 23, 2019

Hi @eleurent ,

thanks for the hyperparameters, I got much better results now, even with SAC (training still in progress but looking much better than before, with a success training rate around 20% after 3e5 steps, which corresponds to a mean training episode reward of -5.8)
I think I will close this issue soon then ;)

EDIT: I updated the network architecture and it converges much faster now (train success rate of 13% in 5e4 steps)

@araffin
Copy link
Contributor Author

araffin commented Apr 23, 2019

Update: I managed to reproduce your results using (on my dev branch HER-2):

import time

import gym
import highway_env
import numpy as np

from stable_baselines import HER, SAC, DDPG
from stable_baselines.ddpg import NormalActionNoise

env = gym.make("highway-parking-v0")

n_actions = env.action_space.shape[0]
noise_std = 0.2
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=noise_std * np.ones(n_actions))

n_sampled_goal = 4

# SAC hyperparams:
model = HER('MlpPolicy', env, SAC, n_sampled_goal=n_sampled_goal,
            goal_selection_strategy='future',
            verbose=1, buffer_size=int(1e6),
            learning_rate=1e-3,
            gamma=0.95, batch_size=256,
            policy_kwargs=dict(layers=[256, 256, 256]))

# DDPG Hyperparams:
# NOTE: it works even without action noise
# model = HER('MlpPolicy', env, DDPG, n_sampled_goal=n_sampled_goal,
#             goal_selection_strategy='future',
#             verbose=1, buffer_size=int(1e6),
#             actor_lr=1e-3, critic_lr=1e-3, action_noise=action_noise,
#             gamma=0.95, batch_size=256,
#             policy_kwargs=dict(layers=[256, 256, 256]))


model.learn(int(2e5))
model.save('sac_her_{}'.format(int(time.time())))

closing this issue then, thanks for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants