-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAC Hyperparameters MountainCarContinuous-v0 - Env with deceptive reward #76
Comments
Hey @araffin, thanks for opening this issue! We've actually observed very similar reward-related problems with SAC recently. I don't remember ever running Here's an example of a simple experiment with the (simulated) screw manipulation environment that we used in [1], where different constant added to the reward results in extremely different performance (in the figure, lower is better): Another thing we have noticed is that there's a noticeable difference in sparse reward setting between setting non-success/success reward to -1/0 vs. 0/1. I've tried to alleviate the problems with the obvious solutions, such as simply One thing I briefly looked into that seemed promising, is the POP-ART normalization [2]. I have a simple prototype of it implemented at hartikainen/softlearning@master...hartikainen:experiment/claw-costs-test-pop-art, however, there seems to be something wrong in the implementation because it completely breaks the algorithm even in simple cases. I probably don't have too much time to look into this at least in the next few weeks, but if you are (or anyone else is) interested in testing this out, I'd be happy to help e.g. with reproducing the problem. PS. Awesome job implementing SAC in baselines! I was planning to do it myself at some point last year, but you got it out much faster 😄 [1] HAARNOJA, Tuomas, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018. (https://arxiv.org/abs/1812.05905) cc @avisingh599 |
Hi @hartikainen , I finally managed to make it work on MountainCarContinuous by adding additional noise to the actions of the behavior policy, in the same fashion DDPG does it. However, this did not solve my other problems ^^#
Interesting, I tested and saw the same behavior you described. With sparse rewards, SAC + HER manages to find the optimum in the 0/1 setting whereas it fails in the -1/0 setting. Do you have an idea of where does this issue with reward offset may come from?
how did you try to normalize them exactly? (I also tried that using running averages but it did not help)
I think I have to take a look at this paper, not the first I see it mentioned. I will maybe try that later (I'm focusing on finishing and testing HER re-implementation for stable-baselines right now).
your welcome =) In fact, the release of spinning up and this repo accelerated the implementation. |
Update: DDPG seems to suffer from the same issue with sparse reward but the other way around: it work in the -1/0 setting and fails in the 0/1 one. Using return normalization / pop-art did not help :/ |
Hi,
I'm not sure if it's related to my implementation but I do see that my critic values go negative very very fast - since the actions chosen at the start always give negative results, the critic over-estimates negative values in the negative sense, leading to the explosion of discounted reward, which is a term in the critic's target, leading to further negative estimates etc., etc. It seems like this explosion happens before the action space is sufficiently explored, so the agent never finds good actions. This won't be the case for a 0/1 setting since the explosion will be in the "good" sense, reinforcing good behaviour. |
Same issue here. Any thoughts? Since it seems like an exploration problem, I'm currently trying to tune the temperature parameter |
Can you remember the scale of additional noise? Or any ideas? thx |
Hello, You can find working hyperparameters in the rl zoo, the noise standard deviation is quite high (0.5 compared to "classic" values of 0.1-0.2 normally used) |
Hi @araffin , Nice work there! I noticed you're using an automatic ent_coef with an OU noise of 0.5 to improve exploration.
Did you mean MountainCarContinuous-v0 could be solved by SAC + HER? (or it just solves other sparse reward envs and we have to use action noise to explore here) |
this is just for convenience, the external noise scale is what makes things work.
Ah no, I was talking about environments tailored for HER. |
That's too bad. I just briefly read about HER and was hoping HER to solve this. But if we know the goal like HER does, we could implement reward shaping to lead the agent. In my limited numbers of experiments, an extra reward 0.1 * abs(goal - position) made SAC explore better. However, reward shaping changes the aim and should hinder the agent to explore in another direction. So I guess the improvement I saw was a coincidence...... Another thought is that we can improve the weight of less observed action-state pairs during the training (something like a continuous monte carlo search tree). I'll search for related papers and hopefully try this idea when I finish my project in hand. |
Hello,
I've tried in vain to find suitable hyperparameters for SAC in order to solve MountainCarContinuous-v0.
Even with hyperparameter tuning (see "add-trpo" branch of rl baselines zoo), I was not able to solve it consistently (if during random exploration it finds the goal, then it will work, otherwise, it will be stuck in a local minima).
I also encountered that issue when trying SAC on another environment with deceptive reward (bit flipping env, trying to apply HER + SAC, see here).
Did you manage to solve that problem? If so, what hyperparameters did you use?
Note: I am using the SAC implementation from stable-baselines that works pretty well on all others problems (but where the reward is dense).
The text was updated successfully, but these errors were encountered: