hill-a · araffin · Jun 4, 2019 · Apr 11, 2019 · Apr 12, 2019 · Apr 15, 2019
diff --git a/README.md b/README.md
@@ -28,14 +28,14 @@ This toolset is a fork of OpenAI Baselines, with a major structural refactoring,
 | Common interface            | :heavy_check_mark:                | :heavy_minus_sign: <sup>(3)</sup> |
 | Tensorboard support         | :heavy_check_mark:                | :heavy_minus_sign: <sup>(4)</sup> |
 | Ipython / Notebook friendly | :heavy_check_mark:                | :x:                               |
-| PEP8 code style             | :heavy_check_mark:                | :heavy_minus_sign: <sup>(5)</sup> |
+| PEP8 code style             | :heavy_check_mark:                | :heavy_check_mark: <sup>(5)</sup> |
 | Custom callback             | :heavy_check_mark:                | :heavy_minus_sign: <sup>(6)</sup> |
 
-<sup><sup>(1): Forked from previous version of OpenAI baselines, however missing refactoring for HER.</sup></sup><br>
+<sup><sup>(1): Forked from previous version of OpenAI baselines, with now SAC in addition</sup></sup><br>
 <sup><sup>(2): Currently not available for DDPG, and only from the run script. </sup></sup><br>
 <sup><sup>(3): Only via the run script.</sup></sup><br>
 <sup><sup>(4): Rudimentary logging of training information (no loss nor graph). </sup></sup><br>
-<sup><sup>(5): WIP on OpenAI's side (you can do it OpenAI! :cat:)</sup></sup><br>
+<sup><sup>(5): EDIT: you did it OpenAI! :cat:</sup></sup><br>
 <sup><sup>(6): Passing a callback function is only available for DQN</sup></sup><br>
 
 ## Documentation
@@ -144,25 +144,25 @@ All the following examples can be executed online using Google colab notebooks:
 
 | **Name**            | **Refactored**<sup>(1)</sup> | **Recurrent**      | ```Box```          | ```Discrete```     | ```MultiDiscrete``` | ```MultiBinary```  | **Multi Processing**              |
 | ------------------- | ---------------------------- | ------------------ | ------------------ | ------------------ | ------------------- | ------------------ | --------------------------------- |
-| A2C                 | :heavy_check_mark:           | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark:  | :heavy_check_mark: | :heavy_check_mark:                |
+| A2C                 | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark:  | :heavy_check_mark: | :heavy_check_mark:                |
 | ACER                | :heavy_check_mark:           | :heavy_check_mark: | :x: <sup>(5)</sup> | :heavy_check_mark: | :x:                 | :x:                | :heavy_check_mark:                |
 | ACKTR               | :heavy_check_mark:           | :heavy_check_mark: | :x: <sup>(5)</sup> | :heavy_check_mark: | :x:                 | :x:                | :heavy_check_mark:                |
-| DDPG                | :heavy_check_mark:           | :x:                | :heavy_check_mark: | :x:                | :x:                 | :x:                | :x:                               |
+| DDPG                | :heavy_check_mark:           | :x:                | :heavy_check_mark: | :x:                | :x:                 | :x:                | :heavy_check_mark: <sup>(4)</sup>|
 | DQN                 | :heavy_check_mark:           | :x:                | :x:                | :heavy_check_mark: | :x:                 | :x:                | :x:                               |
 | GAIL <sup>(2)</sup> | :heavy_check_mark:           | :x:                | :heavy_check_mark: |:heavy_check_mark:| :x:                 | :x:                | :heavy_check_mark: <sup>(4)</sup> |
-| HER <sup>(3)</sup>  | :x: <sup>(5)</sup>           | :x:                | :heavy_check_mark: | :x:                | :x:                 | :x:                | :x:                               |
+| HER <sup>(3)</sup>  | :heavy_check_mark: | :x:                | :heavy_check_mark: | :heavy_check_mark: | :x:                 | :heavy_check_mark:| :x:                               |
 | PPO1                | :heavy_check_mark:           | :x:                | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark:  | :heavy_check_mark: | :heavy_check_mark: <sup>(4)</sup> |
 | PPO2                | :heavy_check_mark:           | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark:  | :heavy_check_mark: | :heavy_check_mark:                |
 | SAC                 | :heavy_check_mark:           | :x:                | :heavy_check_mark: | :x:                | :x:                 | :x:                | :x:                               |
 | TRPO                | :heavy_check_mark:           | :x:                | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark:  | :heavy_check_mark: | :heavy_check_mark: <sup>(4)</sup> |
 
 <sup><sup>(1): Whether or not the algorithm has be refactored to fit the ```BaseRLModel``` class.</sup></sup><br>
 <sup><sup>(2): Only implemented for TRPO.</sup></sup><br>
-<sup><sup>(3): Only implemented for DDPG.</sup></sup><br>
+<sup><sup>(3): Re-implemented from scratch</sup></sup><br>
 <sup><sup>(4): Multi Processing with [MPI](https://mpi4py.readthedocs.io/en/stable/).</sup></sup><br>
 <sup><sup>(5): TODO, in project scope.</sup></sup>
 
-NOTE: Soft Actor-Critic (SAC) was not part of the original baselines.
+NOTE: Soft Actor-Critic (SAC) was not part of the original baselines and HER was reimplemented from scratch.
 
 Actions ```gym.spaces```:
  * ```Box```: A N-dimensional box that containes every point in the action space.
@@ -191,14 +191,14 @@ please tell us when if you want your project to appear on this page ;)
 To cite this repository in publications:
 
 ```
-    @misc{stable-baselines,
-      author = {Hill, Ashley and Raffin, Antonin and Ernestus, Maximilian and Gleave, Adam and Traore, Rene and Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai},
-      title = {Stable Baselines},
-      year = {2018},
-      publisher = {GitHub},
-      journal = {GitHub repository},
-      howpublished = {\url{https://github.com/hill-a/stable-baselines}},
-    }
+@misc{stable-baselines,
+  author = {Hill, Ashley and Raffin, Antonin and Ernestus, Maximilian and Gleave, Adam and Traore, Rene and Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai},
+  title = {Stable Baselines},
+  year = {2018},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/hill-a/stable-baselines}},
+}
 ```
 
 ## Maintainers

diff --git a/docs/guide/algos.rst b/docs/guide/algos.rst
@@ -11,34 +11,31 @@ along with some useful characteristics: support for recurrent policies, discrete
 .. A2C   ✔️
 .. ===== ======================== ========= ======= ============ ================= =============== ================
 
-.. There is an issue with Read The Docs for building the table when the "HER" row is present:
-.. Apparently a problem of spacing
-.. HER [#f3]_   ❌ [#f5]_                ❌        ✔️           ❌           ❌
-
 
 ============ ======================== ========= =========== ============ ================
 Name         Refactored [#f1]_        Recurrent ``Box``     ``Discrete`` Multi Processing
 ============ ======================== ========= =========== ============ ================
 A2C          ✔️                        ✔️         ✔️           ✔️            ✔️
-ACER         ✔️                        ✔️         ❌ [#f5]_   ✔️            ✔️
-ACKTR        ✔️                        ✔️         ❌ [#f5]_   ✔️            ✔️
-DDPG         ✔️                        ❌        ✔️           ❌           ❌
+ACER         ✔️                        ✔️         ❌ [#f4]_   ✔️            ✔️
+ACKTR        ✔️                        ✔️         ❌ [#f4]_   ✔️            ✔️
+DDPG         ✔️                        ❌        ✔️           ❌           ✔️ [#f3]_
 DQN          ✔️                        ❌        ❌          ✔️            ❌
-GAIL [#f2]_  ✔️                        ✔️         ✔️           ✔️            ✔️ [#f4]_
-PPO1         ✔️                        ❌        ✔️           ✔️            ✔️ [#f4]_
+HER          ✔️                        ❌        ✔️           ✔️            ❌
+GAIL [#f2]_  ✔️                        ✔️         ✔️           ✔️            ✔️ [#f3]_
+PPO1         ✔️                        ❌        ✔️           ✔️            ✔️ [#f3]_
 PPO2         ✔️                        ✔️         ✔️           ✔️            ✔️
 SAC          ✔️                        ❌        ✔️           ❌           ❌
-TRPO         ✔️                        ❌        ✔️           ✔️            ✔️ [#f4]_
+TRPO         ✔️                        ❌        ✔️           ✔️            ✔️ [#f3]_
 ============ ======================== ========= =========== ============ ================
 
 .. [#f1] Whether or not the algorithm has be refactored to fit the ``BaseRLModel`` class.
 .. [#f2] Only implemented for TRPO.
-.. [#f3] Only implemented for DDPG.
-.. [#f4] Multi Processing with `MPI`_.
-.. [#f5] TODO, in project scope.
+.. [#f3] Multi Processing with `MPI`_.
+.. [#f4] TODO, in project scope.
 
 .. note::
-    Non-array spaces such as `Dict` or `Tuple` are not currently supported by any algorithm.
+    Non-array spaces such as `Dict` or `Tuple` are not currently supported by any algorithm,
+    except HER for dict when working with gym.GoalEnv
 
 Actions ``gym.spaces``:
 

diff --git a/docs/guide/custom_env.rst b/docs/guide/custom_env.rst
@@ -8,9 +8,8 @@ That is to say, your environment must implement the following methods (and inher
 
 
 .. note::
-
-	 If you are using images as input, the input values must be in [0, 255] as the observation
-   is normalized (dividing by 255 to have values in [0, 1]) when using CNN policies.
+	If you are using images as input, the input values must be in [0, 255] as the observation
+	is normalized (dividing by 255 to have values in [0, 1]) when using CNN policies.
 
 
 

diff --git a/docs/guide/custom_policy.rst b/docs/guide/custom_policy.rst
@@ -216,32 +216,30 @@ If your task requires even more granular control over the policy architecture, y
               value_fn = tf.layers.dense(vf_h, 1, name='vf')
               vf_latent = vf_h
 
-              self.proba_distribution, self.policy, self.q_value = \
+              self._proba_distribution, self._policy, self.q_value = \
                   self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)
 
-          self.value_fn = value_fn
-          self.initial_state = None
+          self._value_fn = value_fn
           self._setup_init()
 
       def step(self, obs, state=None, mask=None, deterministic=False):
           if deterministic:
-              action, value, neglogp = self.sess.run([self.deterministic_action, self._value, self.neglogp],
+              action, value, neglogp = self.sess.run([self.deterministic_action, self.value_flat, self.neglogp],
                                                      {self.obs_ph: obs})
           else:
-              action, value, neglogp = self.sess.run([self.action, self._value, self.neglogp],
+              action, value, neglogp = self.sess.run([self.action, self.value_flat, self.neglogp],
                                                      {self.obs_ph: obs})
           return action, value, self.initial_state, neglogp
 
       def proba_step(self, obs, state=None, mask=None):
           return self.sess.run(self.policy_proba, {self.obs_ph: obs})
 
       def value(self, obs, state=None, mask=None):
-          return self.sess.run(self._value, {self.obs_ph: obs})
+          return self.sess.run(self.value_flat, {self.obs_ph: obs})
 
 
   # Create and wrap the environment
-  env = gym.make('Breakout-v0')
-  env = DummyVecEnv([lambda: env])
+  env = DummyVecEnv([lambda: gym.make('Breakout-v0')])
 
   model = A2C(CustomPolicy, env, verbose=1)
   # Train the agent

diff --git a/docs/guide/examples.rst b/docs/guide/examples.rst
@@ -13,29 +13,31 @@ notebooks:
 -  `Monitor Training and Plotting`_
 -  `Atari Games`_
 -  `Breakout`_ (trained agent included)
+-  `Hindsight Experience Replay`_
 -  `RL Baselines zoo`_
 
 .. _Getting Started: https://colab.research.google.com/drive/1_1H5bjWKYBVKbbs-Kj83dsfuZieDNcFU
-.. _Training, Saving, Loading: https://colab.research.google.com/drive/1KoAQ1C_BNtGV3sVvZCnNZaER9rstmy0s
+.. _Training, Saving, Loading: https://colab.research.google.com/drive/16QritJF5kgT3mtnODepld1fo5tFnFCoc
 .. _Multiprocessing: https://colab.research.google.com/drive/1ZzNFMUUi923foaVsYb4YjPy4mjKtnOxb
 .. _Monitor Training and Plotting: https://colab.research.google.com/drive/1L_IMo6v0a0ALK8nefZm6PqPSy0vZIWBT
 .. _Atari Games: https://colab.research.google.com/drive/1iYK11yDzOOqnrXi1Sfjm1iekZr4cxLaN
 .. _Breakout: https://colab.research.google.com/drive/14NwwEHwN4hdNgGzzySjxQhEVDff-zr7O
+.. _Hindsight Experience Replay: https://colab.research.google.com/drive/1VDD0uLi8wjUXIqAdLKiK15XaEe0z2FOc
 .. _RL Baselines zoo: https://colab.research.google.com/drive/1cPGK3XrCqEs3QLqiijsfib9OFht3kObX
 
 .. |colab| image:: ../_static/img/colab.svg
 
 Basic Usage: Training, Saving, Loading
 --------------------------------------
 
-In the following example, we will train, save and load an A2C model on the Lunar Lander environment.
+In the following example, we will train, save and load a DQN model on the Lunar Lander environment.
 
 .. image:: ../_static/img/try_it.png
    :scale: 30 %
-   :target: https://colab.research.google.com/drive/1KoAQ1C_BNtGV3sVvZCnNZaER9rstmy0s
+   :target: https://colab.research.google.com/drive/16QritJF5kgT3mtnODepld1fo5tFnFCoc
 
 
-.. figure:: https://cdn-images-1.medium.com/max/960/1*W7X69nxINgZEcJEAyoHCVw.gif
+.. figure:: https://cdn-images-1.medium.com/max/960/1*f4VZPKOI0PYNWiwt0la0Rg.gif
 
   Lunar Lander Environment
 
@@ -53,25 +55,21 @@ In the following example, we will train, save and load an A2C model on the Lunar
 
   import gym
 
-  from stable_baselines.common.policies import MlpPolicy
-  from stable_baselines.common.vec_env import DummyVecEnv
-  from stable_baselines import A2C
+  from stable_baselines import DQN
 
-  # Create and wrap the environment
+  # Create environment
   env = gym.make('LunarLander-v2')
-  env = DummyVecEnv([lambda: env])
 
-  # Alternatively, you can directly use:
-  # model = A2C('MlpPolicy', 'LunarLander-v2', ent_coef=0.1, verbose=1)
-  model = A2C(MlpPolicy, env, ent_coef=0.1, verbose=1)
+  # Instantiate the agent
+  model = DQN('MlpPolicy', env, learning_rate=1e-3, prioritized_replay=True, verbose=1)
   # Train the agent
-  model.learn(total_timesteps=100000)
+  model.learn(total_timesteps=int(2e5))
   # Save the agent
-  model.save("a2c_lunar")
+  model.save("dqn_lunar")
   del model  # delete trained model to demonstrate loading
 
   # Load the trained agent
-  model = A2C.load("a2c_lunar")
+  model = DQN.load("dqn_lunar")
 
   # Enjoy trained agent
   obs = env.reset()
@@ -159,12 +157,11 @@ If your callback returns False, training is aborted early.
   import numpy as np
   import matplotlib.pyplot as plt
 
-  from stable_baselines.ddpg.policies import MlpPolicy
-  from stable_baselines.common.vec_env.dummy_vec_env import DummyVecEnv
+  from stable_baselines.ddpg.policies import LnMlpPolicy
   from stable_baselines.bench import Monitor
   from stable_baselines.results_plotter import load_results, ts2xy
   from stable_baselines import DDPG
-  from stable_baselines.ddpg.noise import AdaptiveParamNoiseSpec
+  from stable_baselines.ddpg import AdaptiveParamNoiseSpec
 
 
   best_mean_reward, n_steps = -np.inf, 0
@@ -178,7 +175,7 @@ If your callback returns False, training is aborted early.
     global n_steps, best_mean_reward
     # Print stats every 1000 calls
     if (n_steps + 1) % 1000 == 0:
-        # Evaluate policy performance
+        # Evaluate policy training performance
         x, y = ts2xy(load_results(log_dir), 'timesteps')
         if len(x) > 0:
             mean_reward = np.mean(y[-100:])
@@ -202,13 +199,14 @@ If your callback returns False, training is aborted early.
   # Create and wrap the environment
   env = gym.make('LunarLanderContinuous-v2')
   env = Monitor(env, log_dir, allow_early_resets=True)
-  env = DummyVecEnv([lambda: env])
 
   # Add some param noise for exploration
-  param_noise = AdaptiveParamNoiseSpec(initial_stddev=0.2, desired_action_stddev=0.2)
-  model = DDPG(MlpPolicy, env, param_noise=param_noise, memory_limit=int(1e6), verbose=0)
+  param_noise = AdaptiveParamNoiseSpec(initial_stddev=0.1, desired_action_stddev=0.1)
+  # Because we use parameter noise, we should use a MlpPolicy with layer normalization
+  model = DDPG(LnMlpPolicy, env, param_noise=param_noise, verbose=0)
   # Train the agent
-  model.learn(total_timesteps=200000, callback=callback)
+  model.learn(total_timesteps=int(1e5), callback=callback)
+
 
 Atari Games
 -----------
@@ -440,6 +438,84 @@ This example demonstrate how to train a recurrent policy and how to test it prop
       env.render()
 
 
+Hindsight Experience Replay (HER)
+---------------------------------
+
+For this example, we are using `Highway-Env <https://github.com/eleurent/highway-env>`_ by `@eleurent <https://github.com/eleurent>`_.
+
+
+.. image:: ../_static/img/try_it.png
+   :scale: 30 %
+   :target: https://colab.research.google.com/drive/1VDD0uLi8wjUXIqAdLKiK15XaEe0z2FOc
+
+
+.. figure:: https://raw.githubusercontent.com/eleurent/highway-env/gh-media/docs/media/parking-env.gif
+
+   The highway-parking-v0 environment.
+
+The parking env is a goal-conditioned continuous control task, in which the vehicle must park in a given space with the appropriate heading.
+
+.. note::
+
+	the hyperparameters in the following example were optimized for that environment.
+
+
+.. code-block:: python
+
+  import gym
+  import highway_env
+  import numpy as np
+
+  from stable_baselines import HER, SAC, DDPG
+  from stable_baselines.ddpg import NormalActionNoise
+
+  env = gym.make("parking-v0")
+
+  # Create 4 artificial transitions per real transition
+  n_sampled_goal = 4
+
+  # SAC hyperparams:
+  model = HER('MlpPolicy', env, SAC, n_sampled_goal=n_sampled_goal,
+              goal_selection_strategy='future',
+              verbose=1, buffer_size=int(1e6),
+              learning_rate=1e-3,
+              gamma=0.95, batch_size=256,
+              policy_kwargs=dict(layers=[256, 256, 256]))
+
+  # DDPG Hyperparams:
+  # NOTE: it works even without action noise
+  # n_actions = env.action_space.shape[0]
+  # noise_std = 0.2
+  # action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=noise_std * np.ones(n_actions))
+  # model = HER('MlpPolicy', env, DDPG, n_sampled_goal=n_sampled_goal,
+  #             goal_selection_strategy='future',
+  #             verbose=1, buffer_size=int(1e6),
+  #             actor_lr=1e-3, critic_lr=1e-3, action_noise=action_noise,
+  #             gamma=0.95, batch_size=256,
+  #             policy_kwargs=dict(layers=[256, 256, 256]))
+
+
+  model.learn(int(2e5))
+  model.save('her_sac_highway')
+
+  # Load saved model
+  model = HER.load('her_sac_highway', env=env)
+
+  obs = env.reset()
+
+  # Evaluate the agent
+  episode_reward = 0
+  for _ in range(100):
+  	action, _ = model.predict(obs)
+  	obs, reward, done, info = env.step(action)
+  	env.render()
+  	episode_reward += reward
+  	if done or info.get('is_success', False):
+  		print("Reward:", episode_reward, "Success?", info.get('is_success', False))
+  		episode_reward = 0.0
+  		obs = env.reset()
+
+
 
 Continual Learning
 ------------------