DDPG + HER - ParkingEnv-v0 #15

araffin · 2019-04-21T18:45:37Z

Hello,

I'm currently checking performance on ParkingEnv of a new HER implementation for stable-baselines (see hill-a/stable-baselines#273) and I was wondering what hyperparameters did you use for that environment?

Especially, how many steps, and what were ddpg hyperparams, her hyperparams (and which implementations)?
I'm also interested in knowing what was the best mean reward achieved in your experiment ;)

Currently, after 1e6 steps, with default hyperparams, normal noise of std 0.15, using 'future" goal selection strategy with k=4, I got a mean reward around 9.
The learned policy looks ok but not as good as your result.

PS: It seems that you are using a deprecated feature of gym, but I can open another issue for that
the warning:

warnings.warn("DEPRECATION WARNING wrapper_config.TimeLimit has been deprecated. 
Replace any calls to `register(tags={'wrapper_config.TimeLimit.max_episode_steps': 200)}`
with `register(max_episode_steps=200)`.
This change was made 2017/1/31 and is included in gym version 0.8.0.
If you are getting many of these warnings, you may need to update switch from universe 0.21.3 to retro (https://github.com/openai/retro)")

The text was updated successfully, but these errors were encountered:

eleurent · 2019-04-22T11:00:13Z

Hi Antonin,
I'm using the original https://github.com/openai/baselines implementation with default hyperparameters for both her and ddpg and with num_timesteps=1e4.
You can try the following test script: /scripts/baselines_run.py

Here is a sample output, listing the default hyperparameters and training stats:

{'load_path': '~/models/latest', 'network': 'default'}
T: 20
_Q_lr: 0.001
_action_l2: 1.0
_batch_size: 256
_buffer_size: 1000000
_clip_obs: 200.0
_hidden: 256
_layers: 3
_max_u: 1.0
_network_class: baselines.her.actor_critic:ActorCritic
_norm_clip: 5
_norm_eps: 0.01
_pi_lr: 0.001
_polyak: 0.95
_relative_goals: False
_scope: ddpg
aux_loss_weight: 0.0078
bc_loss: 0
ddpg_params: {'buffer_size': 1000000, 'hidden': 256, 'layers': 3, 'network_class': 'baselines.her.actor_critic:ActorCritic', 'polyak': 0.95, 'batch_size': 256, 'Q_lr': 0.001, 'pi_lr': 0.001, 'norm_eps': 0.01, 'norm_clip': 5, 'max_u': 1.0, 'action_l2': 1.0, 'clip_obs': 200.0, 'scope': 'ddpg', 'relative_goals': False}
demo_batch_size: 128
env_name: highway-parking-v0
gamma: 0.95
make_env: <function prepare_params.<locals>.make_env at 0x00000182FE639598>
n_batches: 40
n_cycles: 50
n_test_rollouts: 10
noise_eps: 0.2
num_demo: 100
prm_loss_weight: 0.001
q_filter: 0
random_eps: 0.3
replay_k: 4
replay_strategy: future
rollout_batch_size: 1
test_with_polyak: False

*** Warning ***
You are running HER with just a single MPI worker. This will work, but the experiments that we report in Plappert et al. (2018, https://arxiv.org/abs/1802.09464) were obtained with --num_cpu 19. This makes a significant difference and if you are looking to reproduce those results, be aware of this. Please also refer to https://github.com/openai/baselines/issues/314 for further details.
****************

Creating a DDPG agent with action space 2 x 1.0...
Training...
--------------------------------------
| epoch              | 0             |
| stats_g/mean       | -0.0012410134 |
| stats_g/std        | 0.071867436   |
| stats_o/mean       | -0.0004943212 |
| stats_o/std        | 0.063166946   |
| test/episode       | 10.0          |
| test/mean_Q        | -2.371805     |
| test/success_rate  | 0.4           |
| train/episode      | 50.0          |
| train/success_rate | 0.0           |
--------------------------------------
--------------------------------------
| epoch              | 1             |
| stats_g/mean       | -0.0012013107 |
| stats_g/std        | 0.072002545   |
| stats_o/mean       | -0.0004833463 |
| stats_o/std        | 0.06337828    |
| test/episode       | 20.0          |
| test/mean_Q        | -2.3061156    |
| test/success_rate  | 0.3           |
| train/episode      | 100.0         |
| train/success_rate | 0.04          |
--------------------------------------
---------------------------------------
| epoch              | 2              |
| stats_g/mean       | -0.0011480862  |
| stats_g/std        | 0.072064586    |
| stats_o/mean       | -0.00043134554 |
| stats_o/std        | 0.06350243     |
| test/episode       | 30.0           |
| test/mean_Q        | -2.392759      |
| test/success_rate  | 0.8            |
| train/episode      | 150.0          |
| train/success_rate | 0.0            |
---------------------------------------
---------------------------------------
| epoch              | 3              |
| stats_g/mean       | -0.0010096101  |
| stats_g/std        | 0.072232425    |
| stats_o/mean       | -0.00035306226 |
| stats_o/std        | 0.06374396     |
| test/episode       | 40.0           |
| test/mean_Q        | -1.9268663     |
| test/success_rate  | 0.6            |
| train/episode      | 200.0          |
| train/success_rate | 0.04           |
---------------------------------------
---------------------------------------
| epoch              | 4              |
| stats_g/mean       | -0.0011195819  |
| stats_g/std        | 0.072448455    |
| stats_o/mean       | -0.00047702927 |
| stats_o/std        | 0.06399239     |
| test/episode       | 50.0           |
| test/mean_Q        | -1.815284      |
| test/success_rate  | 0.8            |
| train/episode      | 250.0          |
| train/success_rate | 0.0            |
---------------------------------------
---------------------------------------
| epoch              | 5              |
| stats_g/mean       | -0.001369102   |
| stats_g/std        | 0.07245933     |
| stats_o/mean       | -0.00071890303 |
| stats_o/std        | 0.064110346    |
| test/episode       | 60.0           |
| test/mean_Q        | -1.8090712     |
| test/success_rate  | 1.0            |
| train/episode      | 300.0          |
| train/success_rate | 0.08           |
---------------------------------------
--------------------------------------
| epoch              | 6             |
| stats_g/mean       | -0.0013151062 |
| stats_g/std        | 0.07252834    |
| stats_o/mean       | -0.0006881503 |
| stats_o/std        | 0.0642819     |
| test/episode       | 70.0          |
| test/mean_Q        | -1.9746447    |
| test/success_rate  | 0.9           |
| train/episode      | 350.0         |
| train/success_rate | 0.06          |
--------------------------------------
--------------------------------------
| epoch              | 7             |
| stats_g/mean       | -0.0013556568 |
| stats_g/std        | 0.07271606    |
| stats_o/mean       | -0.0007181207 |
| stats_o/std        | 0.0645066     |
| test/episode       | 80.0          |
| test/mean_Q        | -2.0683777    |
| test/success_rate  | 1.0           |
| train/episode      | 400.0         |
| train/success_rate | 0.06          |
--------------------------------------
---------------------------------------
| epoch              | 8              |
| stats_g/mean       | -0.0014176448  |
| stats_g/std        | 0.07286355     |
| stats_o/mean       | -0.00080151745 |
| stats_o/std        | 0.06478638     |
| test/episode       | 90.0           |
| test/mean_Q        | -2.0549748     |
| test/success_rate  | 0.9            |
| train/episode      | 450.0          |
| train/success_rate | 0.02           |
---------------------------------------
--------------------------------------
| epoch              | 9             |
| stats_g/mean       | -0.0013835019 |
| stats_g/std        | 0.07291278    |
| stats_o/mean       | -0.0008103596 |
| stats_o/std        | 0.064902104   |
| test/episode       | 100.0         |
| test/mean_Q        | -1.9726061    |
| test/success_rate  | 1.0           |
| train/episode      | 500.0         |
| train/success_rate | 0.04          |
--------------------------------------```

Does that help?

Thanks for letting me know of that deprecated warning, I'm probably still using an old version of gym >_<

araffin · 2019-04-22T11:15:52Z

Ok, thanks this should be enough ;), and using how many workers?

EDIT: it was one apparently (just saw that in the logs)

araffin · 2019-04-23T11:19:24Z

Hi @eleurent ,

thanks for the hyperparameters, I got much better results now, even with SAC (training still in progress but looking much better than before, with a success training rate around 20% after 3e5 steps, which corresponds to a mean training episode reward of -5.8)
I think I will close this issue soon then ;)

EDIT: I updated the network architecture and it converges much faster now (train success rate of 13% in 5e4 steps)

araffin · 2019-04-23T12:59:06Z

Update: I managed to reproduce your results using (on my dev branch HER-2):

import time

import gym
import highway_env
import numpy as np

from stable_baselines import HER, SAC, DDPG
from stable_baselines.ddpg import NormalActionNoise

env = gym.make("highway-parking-v0")

n_actions = env.action_space.shape[0]
noise_std = 0.2
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=noise_std * np.ones(n_actions))

n_sampled_goal = 4

# SAC hyperparams:
model = HER('MlpPolicy', env, SAC, n_sampled_goal=n_sampled_goal,
            goal_selection_strategy='future',
            verbose=1, buffer_size=int(1e6),
            learning_rate=1e-3,
            gamma=0.95, batch_size=256,
            policy_kwargs=dict(layers=[256, 256, 256]))

# DDPG Hyperparams:
# NOTE: it works even without action noise
# model = HER('MlpPolicy', env, DDPG, n_sampled_goal=n_sampled_goal,
#             goal_selection_strategy='future',
#             verbose=1, buffer_size=int(1e6),
#             actor_lr=1e-3, critic_lr=1e-3, action_noise=action_noise,
#             gamma=0.95, batch_size=256,
#             policy_kwargs=dict(layers=[256, 256, 256]))


model.learn(int(2e5))
model.save('sac_her_{}'.format(int(time.time())))

closing this issue then, thanks for the help.

araffin mentioned this issue Apr 22, 2019

[feature request] Implement goal-parameterized algorithms (HER) hill-a/stable-baselines#198

Closed

araffin closed this as completed Apr 23, 2019

araffin mentioned this issue Apr 23, 2019

Hindsight Experience Replay (HER) - Reloaded hill-a/stable-baselines#273

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDPG + HER - ParkingEnv-v0 #15

DDPG + HER - ParkingEnv-v0 #15

araffin commented Apr 21, 2019

eleurent commented Apr 22, 2019

araffin commented Apr 22, 2019 •

edited

Loading

araffin commented Apr 23, 2019 •

edited

Loading

araffin commented Apr 23, 2019 •

edited

Loading

DDPG + HER - ParkingEnv-v0 #15

DDPG + HER - ParkingEnv-v0 #15

Comments

araffin commented Apr 21, 2019

eleurent commented Apr 22, 2019

araffin commented Apr 22, 2019 • edited Loading

araffin commented Apr 23, 2019 • edited Loading

araffin commented Apr 23, 2019 • edited Loading

araffin commented Apr 22, 2019 •

edited

Loading

araffin commented Apr 23, 2019 •

edited

Loading

araffin commented Apr 23, 2019 •

edited

Loading