-
Notifications
You must be signed in to change notification settings - Fork 725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] Implement goal-parameterized algorithms (HER) #198
Comments
Some questions to clarify my understanding:
You meant second, no?
Good news then =)
For debugging and for your tests, Fetch env is ok. However, we will need either to adapt an existing env (e.g. Pendulum-v0) or create an artificial one (e.g. Identity env) in order to write unit tests for it (Mujoco requires a license and we don't have one for travis...). For an open source Goal Env, you can also look into the parking env by @eleurent in https://github.com/eleurent/highway-env |
(Yes I meant the second, as Ashley). I guess we could modify easily Pendulum or MountainCarContinuous yes.
|
I thought return meant sum of discounted reward for one episode... In the Fetch env, I think it is a sparse reward (0 almost everywhere except when reaching the goal), so last reward is the same of the sum no? |
In Fetch, always -1 and 0 when the goal is touched. It's only for logging, the algorithm uses the transition-based rewards. In that case, saying that the sum is -49 for 50 steps for instance, does not indicate whether the goal was reached in the middle of the episode (in that case the episode is not solved), or at the end (episode solved). |
I see... then for GoalEnv, it makes sense to have that feature (showing last return only). |
The problem is that it's computed outside of the env, in the learn function of the algorithm. That would require to update all algorithm to allow that. |
I would update only algorithms that can be used with HER (and therefore GoalEnv), I'm not sure if it makes sense for other type of env. |
Is this still being worked on? This feature is something I'm quite interested in. |
Update: In the meantime I also tried implementing it using sac's spinning up. It learns perfectly on FetchReach but nothing on FetchPush. I'm quite puzzled because I already tried to reproduce results on FetchPush last year using a TD3 base and it also worked perfectly on FetchReach and not on FetchPush. Either I do something wrong, or there is some special trick in the OpenAI Baselines that I didn't catch. Their version uses 19 worked in parallel, each doing 2 rollouts, computing an update using a batch of 256 and summing the 19 updates (yes they sum, they don't average). I would say it's roughly equivalent to do 38 collection rollout, then to use a 19 times bigger batch size and 19 times bigger learning rate (the sum of 19 updates). I tried this also but it got even worse. I don't have much time these days so it's on pause right now. |
@ccolas don't hesitate to ping me if you need some help to double check some parts ;) |
@ccolas @hill-a I'm taking over for this one. Here is my current plan (and current progress):
To overcome current limitation of stable-baselines (dict obs are not supported), I'll do something similar to @hill-a , using a wrapper over the environment. Roadmap:
Note: I consider the last point to be not a priority |
@ccolas In short, I made those choices, which is mostly wrappers:
to sum it up, the implementation now looks pretty simple, it is just a wrapper around a model and an env (I still don't understand why the original baselines were so complicated). |
Super cool, thanks ! |
@ccolas In my experiments, I found that SAC is quite hard to tune for problem with deceptive reward (I did not find good hyperparameters yet for MountainCar-v0 for instance), so this can be an issue when working with problem like FetchPush. I think I will open an issue on the original repo. Also, the original baselines have several tricks (and I don't which one is useful or not) compared to the original HER paper:
PS: SAC and DDPG are now supported on my dev branch ;) (just missing saving/loading for now) |
@ccolas interesting discussion on SAC with sparse rewards is happening there ;): |
Very good work, thanks !
I'll try to run it on the mujoco envs to see what it gives ! |
Yes, from my experience, it allows a better exploration and changes a lot.
That's true, but I also feel it will be less clear in the implementation, no? among the tricks I forgot, there is also a l2 loss on the actions. (and I have to check the DDPG implementation, to see what is different from the one in the baselines) Linking related issue of highway envs: Farama-Foundation/HighwayEnv#15 Edit: another additionnal trick: random_eps, they perform pure random exploration a fraction of the time |
@ccolas The current hyperparameters I'm using for the highway-env (for SAC and DDPG) and that works better than the default ones (close to the default found in openai implementation): SAC: n_sampled_goal = 4
model = HER('MlpPolicy', env, SAC, n_sampled_goal=n_sampled_goal, goal_selection_strategy='future',
verbose=1, buffer_size=int(1e6),
learning_rate=1e-3,
gamma=0.95, batch_size=256, policy_kwargs=dict(layers=[256, 256, 256])) DDPG: n_actions = env.action_space.shape[0]
noise_std = 0.2
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=noise_std * np.ones(n_actions))
n_sampled_goal = 4
model = HER('MlpPolicy', env, DDPG, n_sampled_goal=n_sampled_goal, goal_selection_strategy='future',
verbose=1, buffer_size=int(1e6),
actor_lr=1e-3, critic_lr=1e-3, action_noise=action_noise,
gamma=0.95, batch_size=256, policy_kwargs=dict(layers=[256, 256, 256])) Note: EDIT: the network architecture seems to have a great impact here... (I updated SAC hyperparams) |
Also related for the tricks: rail-berkeley/rlkit#35 |
Update: working version with HER + DDPG on FetchPush. Still a bug with VecEnvs but should be ready to merge soon (see PR) |
I saw that the algorithms can perform well in DDPG+HER and SCE+HER for Fetch Push Environment. How about Pick and Place? I saw the issue mentioned by fisherxue is still in the status of Open. |
You can take a look at the trained agent (and hyperparameters) in the zoo ;) |
I'd like to implement Hindsight Experience Replay (HER). This can be based on a whatever goal-parameterized RL off-policy algorithm.
Goal-parameterized architectures: it requires a variable for the current goal and one for the current outcome. By outcome, I mean anything that is requires to compute the current outcome in the process of targeting the goal, e.g. the RL task is to reach a 3D target (the goal) with a robotic hand. The position of the target is the goal, the position of the hand is the outcome. The reward is a function of the distance between the two. Goal and outcome are usually subparts of the state space.
How Gym handles this: In Gym, there is a class called GoalEnv to deal with such environments.
Stable-baselines does not consider this so far. The replay buffer, BasePolicy, BaseRLModels OffPolicyRLModels only consider observation, and are not made to include a notion of goal or outcome. Two solutions:
I think the second is more clear as it separates observation from goals and outcomes, but probably it would make the code less easy to follow, and would require more changes than the first option. So let's go for the first as Ashley started.
First thoughts on how it could be done.
we need (as Ashley started to do), a wrapper around the gym environment. GoalEnv are different from usual env because they return a dict in place of the former observation vector. This wrapper would unpack the observation in obs, goal, outcome from the GoalEnv.step. It would return a concatenation of all of those. Ashley considered that the goal was in the observation space, so that the concatenation was twice as long as the observation. This is generally not true. So we would need to keep as attribute the size of the goal and outcome spaces. It would keep the different spaces as attributes, keep the function to sample goals, and the reward function.
A multi-goal replay buffer to implement HER replay. It takes the observation from the buffer and redecompose it in obs, goal and outcome before performing replay.
I think it does not require too much work after what Ashley started to do. It would be a few modifications to integrate the GoalEnv of gym, as it is a standard way to use multi-goal environments. Then correct the assumption he made about the dimension of the goal.
If you're all ok, I will start in that direction and test them on the Fetch environments. In the baselines, their performance is achieved with 19 processes in parallel. They basically average the update of the 19 actors. I'll try first without parallelization.
The text was updated successfully, but these errors were encountered: