Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hindsight Experience Replay (HER) - Reloaded #273

Merged
merged 51 commits into from
Jun 4, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
a615b2a
Add bit flipping env
araffin Apr 11, 2019
5bfa61c
HER reloaded (WIP)
araffin Apr 12, 2019
7ff5208
DQN + HER
araffin Apr 15, 2019
3e67330
Add support for SAC and DDPG
araffin Apr 16, 2019
dab5647
Add tests for SAC and DDPG + HER
araffin Apr 20, 2019
9e42f1e
Bug fix + add comments
araffin Apr 20, 2019
63ffc83
Add action noise for SAC
araffin Apr 20, 2019
2e79261
Add note about pop-art normalization
araffin Apr 21, 2019
12ab42e
Merge branch 'master' into HER-2
araffin Apr 21, 2019
eb0da05
Merge branch 'master' into HER-2
araffin Apr 22, 2019
a9f43af
Add saving/loading
araffin Apr 22, 2019
ca32a5f
Add success rate
araffin Apr 22, 2019
8023bbc
Fix HER learning method
araffin Apr 23, 2019
abe17f3
Merge branch 'master' into HER-2
araffin Apr 23, 2019
09e514d
Add support for VecEnv
araffin Apr 27, 2019
c6479e4
Update documentation
araffin Apr 27, 2019
c72e760
Add HER example
araffin Apr 28, 2019
fc3d592
Merge branch 'master' into HER-2
araffin Apr 28, 2019
20fda69
Merge branch 'master' into HER-2
araffin Apr 28, 2019
36fd201
Merge branch 'master' into HER-2
araffin Apr 30, 2019
5799fd9
Merge branch 'master' into HER-2
araffin May 4, 2019
88cb4e5
Removed unused dependencies (tdqm, dill, progressbar2, seaborn, glob2…
araffin May 4, 2019
6c7f5bb
Remove note on the replay buffer
araffin May 4, 2019
65d21e2
Update doc + add a check for VecEnvWrapper with HER
araffin May 5, 2019
8723869
Update examples + add notebook for HER
araffin May 5, 2019
ea1238b
Merge branch 'master' into HER-2
araffin May 9, 2019
0a3b789
Merge branch 'master' into HER-2
araffin May 11, 2019
6ef753d
Merge branch 'master' into HER-2
araffin May 15, 2019
157b005
Merge branch 'master' into HER-2
araffin May 18, 2019
0be6f84
Add random exploration for SAC and DDPG
araffin May 19, 2019
b208889
Typo in docstring
araffin May 19, 2019
27699bf
Doc update: add fix for DDPG saved models
araffin May 19, 2019
3dfe6b1
Merge branch 'master' into HER-2
araffin May 21, 2019
87db166
Test with reward offset
araffin May 22, 2019
1a7e090
Add GoalEnvNormalize draft
araffin May 22, 2019
7592bbd
Remove GoalEnvNormalize
araffin May 23, 2019
aebdfe9
Merge branch 'master' into HER-2
araffin May 23, 2019
edfe3c3
Merge branch 'master' into HER-2
araffin May 30, 2019
730b171
Fix typo
araffin May 31, 2019
635c7d0
Bug fix for HER + VecEnv
araffin Jun 1, 2019
bf363ad
Fix HER test env
araffin Jun 1, 2019
ccbc5c7
Fixed key order
araffin Jun 1, 2019
e1e344b
Add support for discrete obs space
araffin Jun 2, 2019
096f045
Update doc about reproducing experiments
araffin Jun 2, 2019
7688838
Update doc: DDPG supports multiprocessing with MPI
araffin Jun 2, 2019
5c24590
Merge branch 'master' into HER-2
araffin Jun 2, 2019
cd18225
Fix for new abstract method
araffin Jun 2, 2019
65ef631
Update changelog
araffin Jun 2, 2019
84af166
Fix custom policy example
araffin Jun 4, 2019
e2408eb
Add replay_wrapper to base OffPolicy class
araffin Jun 4, 2019
6ed497d
Fix reimport
araffin Jun 4, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 16 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,14 +28,14 @@ This toolset is a fork of OpenAI Baselines, with a major structural refactoring,
| Common interface | :heavy_check_mark: | :heavy_minus_sign: <sup>(3)</sup> |
| Tensorboard support | :heavy_check_mark: | :heavy_minus_sign: <sup>(4)</sup> |
| Ipython / Notebook friendly | :heavy_check_mark: | :x: |
| PEP8 code style | :heavy_check_mark: | :heavy_minus_sign: <sup>(5)</sup> |
| PEP8 code style | :heavy_check_mark: | :heavy_check_mark: <sup>(5)</sup> |
| Custom callback | :heavy_check_mark: | :heavy_minus_sign: <sup>(6)</sup> |

<sup><sup>(1): Forked from previous version of OpenAI baselines, however missing refactoring for HER.</sup></sup><br>
<sup><sup>(1): Forked from previous version of OpenAI baselines, with now SAC in addition</sup></sup><br>
<sup><sup>(2): Currently not available for DDPG, and only from the run script. </sup></sup><br>
<sup><sup>(3): Only via the run script.</sup></sup><br>
<sup><sup>(4): Rudimentary logging of training information (no loss nor graph). </sup></sup><br>
<sup><sup>(5): WIP on OpenAI's side (you can do it OpenAI! :cat:)</sup></sup><br>
<sup><sup>(5): EDIT: you did it OpenAI! :cat:</sup></sup><br>
<sup><sup>(6): Passing a callback function is only available for DQN</sup></sup><br>

## Documentation
Expand Down Expand Up @@ -144,25 +144,25 @@ All the following examples can be executed online using Google colab notebooks:

| **Name** | **Refactored**<sup>(1)</sup> | **Recurrent** | ```Box``` | ```Discrete``` | ```MultiDiscrete``` | ```MultiBinary``` | **Multi Processing** |
| ------------------- | ---------------------------- | ------------------ | ------------------ | ------------------ | ------------------- | ------------------ | --------------------------------- |
| A2C | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| A2C | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| ACER | :heavy_check_mark: | :heavy_check_mark: | :x: <sup>(5)</sup> | :heavy_check_mark: | :x: | :x: | :heavy_check_mark: |
| ACKTR | :heavy_check_mark: | :heavy_check_mark: | :x: <sup>(5)</sup> | :heavy_check_mark: | :x: | :x: | :heavy_check_mark: |
| DDPG | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :x: | :x: | :x: |
| DDPG | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :x: | :x: | :heavy_check_mark: <sup>(4)</sup>|
| DQN | :heavy_check_mark: | :x: | :x: | :heavy_check_mark: | :x: | :x: | :x: |
| GAIL <sup>(2)</sup> | :heavy_check_mark: | :x: | :heavy_check_mark: |:heavy_check_mark:| :x: | :x: | :heavy_check_mark: <sup>(4)</sup> |
| HER <sup>(3)</sup> | :x: <sup>(5)</sup> | :x: | :heavy_check_mark: | :x: | :x: | :x: | :x: |
| HER <sup>(3)</sup> | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :x: | :heavy_check_mark:| :x: |
| PPO1 | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: <sup>(4)</sup> |
| PPO2 | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| SAC | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :x: | :x: | :x: |
| TRPO | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: <sup>(4)</sup> |

<sup><sup>(1): Whether or not the algorithm has be refactored to fit the ```BaseRLModel``` class.</sup></sup><br>
<sup><sup>(2): Only implemented for TRPO.</sup></sup><br>
<sup><sup>(3): Only implemented for DDPG.</sup></sup><br>
<sup><sup>(3): Re-implemented from scratch</sup></sup><br>
<sup><sup>(4): Multi Processing with [MPI](https://mpi4py.readthedocs.io/en/stable/).</sup></sup><br>
<sup><sup>(5): TODO, in project scope.</sup></sup>

NOTE: Soft Actor-Critic (SAC) was not part of the original baselines.
NOTE: Soft Actor-Critic (SAC) was not part of the original baselines and HER was reimplemented from scratch.

Actions ```gym.spaces```:
* ```Box```: A N-dimensional box that containes every point in the action space.
Expand Down Expand Up @@ -191,14 +191,14 @@ please tell us when if you want your project to appear on this page ;)
To cite this repository in publications:

```
@misc{stable-baselines,
author = {Hill, Ashley and Raffin, Antonin and Ernestus, Maximilian and Gleave, Adam and Traore, Rene and Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai},
title = {Stable Baselines},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/hill-a/stable-baselines}},
}
@misc{stable-baselines,
author = {Hill, Ashley and Raffin, Antonin and Ernestus, Maximilian and Gleave, Adam and Traore, Rene and Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai},
title = {Stable Baselines},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/hill-a/stable-baselines}},
}
```

## Maintainers
Expand Down
25 changes: 11 additions & 14 deletions docs/guide/algos.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,34 +11,31 @@ along with some useful characteristics: support for recurrent policies, discrete
.. A2C ✔️
.. ===== ======================== ========= ======= ============ ================= =============== ================

.. There is an issue with Read The Docs for building the table when the "HER" row is present:
.. Apparently a problem of spacing
.. HER [#f3]_ ❌ [#f5]_ ❌ ✔️ ❌ ❌


============ ======================== ========= =========== ============ ================
Name Refactored [#f1]_ Recurrent ``Box`` ``Discrete`` Multi Processing
============ ======================== ========= =========== ============ ================
A2C ✔️ ✔️ ✔️ ✔️ ✔️
ACER ✔️ ✔️ ❌ [#f5]_ ✔️ ✔️
ACKTR ✔️ ✔️ ❌ [#f5]_ ✔️ ✔️
DDPG ✔️ ❌ ✔️ ❌
ACER ✔️ ✔️ ❌ [#f4]_ ✔️ ✔️
ACKTR ✔️ ✔️ ❌ [#f4]_ ✔️ ✔️
DDPG ✔️ ❌ ✔️ ❌ ✔️ [#f3]_
DQN ✔️ ❌ ❌ ✔️ ❌
GAIL [#f2]_ ✔️ ✔️ ✔️ ✔️ ✔️ [#f4]_
PPO1 ✔️ ❌ ✔️ ✔️ ✔️ [#f4]_
HER ✔️ ❌ ✔️ ✔️ ❌
GAIL [#f2]_ ✔️ ✔️ ✔️ ✔️ ✔️ [#f3]_
PPO1 ✔️ ❌ ✔️ ✔️ ✔️ [#f3]_
PPO2 ✔️ ✔️ ✔️ ✔️ ✔️
SAC ✔️ ❌ ✔️ ❌ ❌
TRPO ✔️ ❌ ✔️ ✔️ ✔️ [#f4]_
TRPO ✔️ ❌ ✔️ ✔️ ✔️ [#f3]_
============ ======================== ========= =========== ============ ================

.. [#f1] Whether or not the algorithm has be refactored to fit the ``BaseRLModel`` class.
.. [#f2] Only implemented for TRPO.
.. [#f3] Only implemented for DDPG.
.. [#f4] Multi Processing with `MPI`_.
.. [#f5] TODO, in project scope.
.. [#f3] Multi Processing with `MPI`_.
.. [#f4] TODO, in project scope.

.. note::
Non-array spaces such as `Dict` or `Tuple` are not currently supported by any algorithm.
Non-array spaces such as `Dict` or `Tuple` are not currently supported by any algorithm,
except HER for dict when working with gym.GoalEnv

Actions ``gym.spaces``:

Expand Down
5 changes: 2 additions & 3 deletions docs/guide/custom_env.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,8 @@ That is to say, your environment must implement the following methods (and inher


.. note::

If you are using images as input, the input values must be in [0, 255] as the observation
is normalized (dividing by 255 to have values in [0, 1]) when using CNN policies.
If you are using images as input, the input values must be in [0, 255] as the observation
is normalized (dividing by 255 to have values in [0, 1]) when using CNN policies.



Expand Down
14 changes: 6 additions & 8 deletions docs/guide/custom_policy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -216,32 +216,30 @@ If your task requires even more granular control over the policy architecture, y
value_fn = tf.layers.dense(vf_h, 1, name='vf')
vf_latent = vf_h

self.proba_distribution, self.policy, self.q_value = \
self._proba_distribution, self._policy, self.q_value = \
self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

self.value_fn = value_fn
self.initial_state = None
self._value_fn = value_fn
self._setup_init()

def step(self, obs, state=None, mask=None, deterministic=False):
if deterministic:
action, value, neglogp = self.sess.run([self.deterministic_action, self._value, self.neglogp],
action, value, neglogp = self.sess.run([self.deterministic_action, self.value_flat, self.neglogp],
{self.obs_ph: obs})
else:
action, value, neglogp = self.sess.run([self.action, self._value, self.neglogp],
action, value, neglogp = self.sess.run([self.action, self.value_flat, self.neglogp],
{self.obs_ph: obs})
return action, value, self.initial_state, neglogp

def proba_step(self, obs, state=None, mask=None):
return self.sess.run(self.policy_proba, {self.obs_ph: obs})

def value(self, obs, state=None, mask=None):
return self.sess.run(self._value, {self.obs_ph: obs})
return self.sess.run(self.value_flat, {self.obs_ph: obs})


# Create and wrap the environment
env = gym.make('Breakout-v0')
env = DummyVecEnv([lambda: env])
env = DummyVecEnv([lambda: gym.make('Breakout-v0')])

model = A2C(CustomPolicy, env, verbose=1)
# Train the agent
Expand Down
122 changes: 99 additions & 23 deletions docs/guide/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,29 +13,31 @@ notebooks:
- `Monitor Training and Plotting`_
- `Atari Games`_
- `Breakout`_ (trained agent included)
- `Hindsight Experience Replay`_
- `RL Baselines zoo`_

.. _Getting Started: https://colab.research.google.com/drive/1_1H5bjWKYBVKbbs-Kj83dsfuZieDNcFU
.. _Training, Saving, Loading: https://colab.research.google.com/drive/1KoAQ1C_BNtGV3sVvZCnNZaER9rstmy0s
.. _Training, Saving, Loading: https://colab.research.google.com/drive/16QritJF5kgT3mtnODepld1fo5tFnFCoc
.. _Multiprocessing: https://colab.research.google.com/drive/1ZzNFMUUi923foaVsYb4YjPy4mjKtnOxb
.. _Monitor Training and Plotting: https://colab.research.google.com/drive/1L_IMo6v0a0ALK8nefZm6PqPSy0vZIWBT
.. _Atari Games: https://colab.research.google.com/drive/1iYK11yDzOOqnrXi1Sfjm1iekZr4cxLaN
.. _Breakout: https://colab.research.google.com/drive/14NwwEHwN4hdNgGzzySjxQhEVDff-zr7O
.. _Hindsight Experience Replay: https://colab.research.google.com/drive/1VDD0uLi8wjUXIqAdLKiK15XaEe0z2FOc
.. _RL Baselines zoo: https://colab.research.google.com/drive/1cPGK3XrCqEs3QLqiijsfib9OFht3kObX

.. |colab| image:: ../_static/img/colab.svg

Basic Usage: Training, Saving, Loading
--------------------------------------

In the following example, we will train, save and load an A2C model on the Lunar Lander environment.
In the following example, we will train, save and load a DQN model on the Lunar Lander environment.

.. image:: ../_static/img/try_it.png
:scale: 30 %
:target: https://colab.research.google.com/drive/1KoAQ1C_BNtGV3sVvZCnNZaER9rstmy0s
:target: https://colab.research.google.com/drive/16QritJF5kgT3mtnODepld1fo5tFnFCoc


.. figure:: https://cdn-images-1.medium.com/max/960/1*W7X69nxINgZEcJEAyoHCVw.gif
.. figure:: https://cdn-images-1.medium.com/max/960/1*f4VZPKOI0PYNWiwt0la0Rg.gif

Lunar Lander Environment

Expand All @@ -53,25 +55,21 @@ In the following example, we will train, save and load an A2C model on the Lunar

import gym

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import A2C
from stable_baselines import DQN

# Create and wrap the environment
# Create environment
env = gym.make('LunarLander-v2')
env = DummyVecEnv([lambda: env])

# Alternatively, you can directly use:
# model = A2C('MlpPolicy', 'LunarLander-v2', ent_coef=0.1, verbose=1)
model = A2C(MlpPolicy, env, ent_coef=0.1, verbose=1)
# Instantiate the agent
model = DQN('MlpPolicy', env, learning_rate=1e-3, prioritized_replay=True, verbose=1)
# Train the agent
model.learn(total_timesteps=100000)
model.learn(total_timesteps=int(2e5))
# Save the agent
model.save("a2c_lunar")
model.save("dqn_lunar")
del model # delete trained model to demonstrate loading

# Load the trained agent
model = A2C.load("a2c_lunar")
model = DQN.load("dqn_lunar")

# Enjoy trained agent
obs = env.reset()
Expand Down Expand Up @@ -159,12 +157,11 @@ If your callback returns False, training is aborted early.
import numpy as np
import matplotlib.pyplot as plt

from stable_baselines.ddpg.policies import MlpPolicy
from stable_baselines.common.vec_env.dummy_vec_env import DummyVecEnv
from stable_baselines.ddpg.policies import LnMlpPolicy
from stable_baselines.bench import Monitor
from stable_baselines.results_plotter import load_results, ts2xy
from stable_baselines import DDPG
from stable_baselines.ddpg.noise import AdaptiveParamNoiseSpec
from stable_baselines.ddpg import AdaptiveParamNoiseSpec


best_mean_reward, n_steps = -np.inf, 0
Expand All @@ -178,7 +175,7 @@ If your callback returns False, training is aborted early.
global n_steps, best_mean_reward
# Print stats every 1000 calls
if (n_steps + 1) % 1000 == 0:
# Evaluate policy performance
# Evaluate policy training performance
x, y = ts2xy(load_results(log_dir), 'timesteps')
if len(x) > 0:
mean_reward = np.mean(y[-100:])
Expand All @@ -202,13 +199,14 @@ If your callback returns False, training is aborted early.
# Create and wrap the environment
env = gym.make('LunarLanderContinuous-v2')
env = Monitor(env, log_dir, allow_early_resets=True)
env = DummyVecEnv([lambda: env])

# Add some param noise for exploration
param_noise = AdaptiveParamNoiseSpec(initial_stddev=0.2, desired_action_stddev=0.2)
model = DDPG(MlpPolicy, env, param_noise=param_noise, memory_limit=int(1e6), verbose=0)
param_noise = AdaptiveParamNoiseSpec(initial_stddev=0.1, desired_action_stddev=0.1)
# Because we use parameter noise, we should use a MlpPolicy with layer normalization
model = DDPG(LnMlpPolicy, env, param_noise=param_noise, verbose=0)
# Train the agent
model.learn(total_timesteps=200000, callback=callback)
model.learn(total_timesteps=int(1e5), callback=callback)


Atari Games
-----------
Expand Down Expand Up @@ -440,6 +438,84 @@ This example demonstrate how to train a recurrent policy and how to test it prop
env.render()


Hindsight Experience Replay (HER)
---------------------------------

For this example, we are using `Highway-Env <https://github.com/eleurent/highway-env>`_ by `@eleurent <https://github.com/eleurent>`_.


.. image:: ../_static/img/try_it.png
:scale: 30 %
:target: https://colab.research.google.com/drive/1VDD0uLi8wjUXIqAdLKiK15XaEe0z2FOc


.. figure:: https://raw.githubusercontent.com/eleurent/highway-env/gh-media/docs/media/parking-env.gif

The highway-parking-v0 environment.

The parking env is a goal-conditioned continuous control task, in which the vehicle must park in a given space with the appropriate heading.

.. note::

the hyperparameters in the following example were optimized for that environment.


.. code-block:: python

import gym
import highway_env
import numpy as np

from stable_baselines import HER, SAC, DDPG
from stable_baselines.ddpg import NormalActionNoise

env = gym.make("parking-v0")

# Create 4 artificial transitions per real transition
n_sampled_goal = 4

# SAC hyperparams:
model = HER('MlpPolicy', env, SAC, n_sampled_goal=n_sampled_goal,
goal_selection_strategy='future',
verbose=1, buffer_size=int(1e6),
learning_rate=1e-3,
gamma=0.95, batch_size=256,
policy_kwargs=dict(layers=[256, 256, 256]))

# DDPG Hyperparams:
# NOTE: it works even without action noise
# n_actions = env.action_space.shape[0]
# noise_std = 0.2
# action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=noise_std * np.ones(n_actions))
# model = HER('MlpPolicy', env, DDPG, n_sampled_goal=n_sampled_goal,
# goal_selection_strategy='future',
# verbose=1, buffer_size=int(1e6),
# actor_lr=1e-3, critic_lr=1e-3, action_noise=action_noise,
# gamma=0.95, batch_size=256,
# policy_kwargs=dict(layers=[256, 256, 256]))


model.learn(int(2e5))
model.save('her_sac_highway')

# Load saved model
model = HER.load('her_sac_highway', env=env)

obs = env.reset()

# Evaluate the agent
episode_reward = 0
for _ in range(100):
action, _ = model.predict(obs)
obs, reward, done, info = env.step(action)
env.render()
episode_reward += reward
if done or info.get('is_success', False):
print("Reward:", episode_reward, "Success?", info.get('is_success', False))
episode_reward = 0.0
obs = env.reset()



Continual Learning
------------------
Expand Down
Loading