Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAC Hyperparameters MountainCarContinuous-v0 - Env with deceptive reward #76

Open
araffin opened this issue Apr 20, 2019 · 10 comments
Open

Comments

@araffin
Copy link

araffin commented Apr 20, 2019

Hello,

I've tried in vain to find suitable hyperparameters for SAC in order to solve MountainCarContinuous-v0.

Even with hyperparameter tuning (see "add-trpo" branch of rl baselines zoo), I was not able to solve it consistently (if during random exploration it finds the goal, then it will work, otherwise, it will be stuck in a local minima).
I also encountered that issue when trying SAC on another environment with deceptive reward (bit flipping env, trying to apply HER + SAC, see here).

Did you manage to solve that problem? If so, what hyperparameters did you use?

Note: I am using the SAC implementation from stable-baselines that works pretty well on all others problems (but where the reward is dense).

@hartikainen
Copy link
Member

hartikainen commented Apr 20, 2019

Hey @araffin, thanks for opening this issue! We've actually observed very similar reward-related problems with SAC recently. I don't remember ever running MountainCarContinuous-v0 myself, so I can't say whether I would expect that particular task to work out of the box or not, but I'm pretty consistently able to reproduce similar issue where adding a constant scalar to the rewards will make SAC learn much slower and in some special cases get stuck in local minima and not be able to solve the task at all.

Here's an example of a simple experiment with the (simulated) screw manipulation environment that we used in [1], where different constant added to the reward results in extremely different performance (in the figure, lower is better):
image

Another thing we have noticed is that there's a noticeable difference in sparse reward setting between setting non-success/success reward to -1/0 vs. 0/1.

I've tried to alleviate the problems with the obvious solutions, such as simply
normalizing the rewards in the environments, but none of the simple solutions don't seem to have the desired effect in general. For example, normalizing returns seemed to help in some cases but then fail on others.

One thing I briefly looked into that seemed promising, is the POP-ART normalization [2]. I have a simple prototype of it implemented at hartikainen/softlearning@master...hartikainen:experiment/claw-costs-test-pop-art, however, there seems to be something wrong in the implementation because it completely breaks the algorithm even in simple cases. I probably don't have too much time to look into this at least in the next few weeks, but if you are (or anyone else is) interested in testing this out, I'd be happy to help e.g. with reproducing the problem.

PS. Awesome job implementing SAC in baselines! I was planning to do it myself at some point last year, but you got it out much faster 😄

[1] HAARNOJA, Tuomas, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018. (https://arxiv.org/abs/1812.05905)
[2] VAN HASSELT, Hado P., et al. Learning values across many orders of magnitude. In: Advances in Neural Information Processing Systems. 2016. p. 4287-4295. (http://papers.nips.cc/paper/6076-learning-values-across-many-orders-of-magnitude.pdf)

cc @avisingh599

@araffin
Copy link
Author

araffin commented Apr 21, 2019

Hi @hartikainen ,

I finally managed to make it work on MountainCarContinuous by adding additional noise to the actions of the behavior policy, in the same fashion DDPG does it. However, this did not solve my other problems ^^#

Another thing we have noticed is that there's a noticeable difference in sparse reward setting between setting non-success/success reward to -1/0 vs. 0/1.

Interesting, I tested and saw the same behavior you described. With sparse rewards, SAC + HER manages to find the optimum in the 0/1 setting whereas it fails in the -1/0 setting.

Do you have an idea of where does this issue with reward offset may come from?

such as simply normalizing the rewards in the environments

how did you try to normalize them exactly? (I also tried that using running averages but it did not help)

I briefly looked into that seemed promising, is the POP-ART normalization

I think I have to take a look at this paper, not the first I see it mentioned. I will maybe try that later (I'm focusing on finishing and testing HER re-implementation for stable-baselines right now).

PS. Awesome job implementing SAC in baselines! I was planning to do it myself at some point last year, but you got it out much faster

your welcome =) In fact, the release of spinning up and this repo accelerated the implementation.

@araffin
Copy link
Author

araffin commented Apr 21, 2019

Update: DDPG seems to suffer from the same issue with sparse reward but the other way around: it work in the -1/0 setting and fails in the 0/1 one. Using return normalization / pop-art did not help :/

@yujia21
Copy link

yujia21 commented May 27, 2019

sac.zip

Hi,
I've been trying to implement this for keras-rl but have not managed to get it to work. I'm not sure if there's an error in my code or if it's the environment/rewards that I am testing on - have so far only been testing on Pendulum, MountainCar, LunarLanding, and BipedalWalker.

With sparse rewards, SAC + HER manages to find the optimum in the 0/1 setting whereas it fails in the -1/0 setting.

I'm not sure if it's related to my implementation but I do see that my critic values go negative very very fast - since the actions chosen at the start always give negative results, the critic over-estimates negative values in the negative sense, leading to the explosion of discounted reward, which is a term in the critic's target, leading to further negative estimates etc., etc. It seems like this explosion happens before the action space is sufficiently explored, so the agent never finds good actions. This won't be the case for a 0/1 setting since the explosion will be in the "good" sense, reinforcing good behaviour.
Let me know your thoughts on this.

@ritou11
Copy link

ritou11 commented Mar 25, 2020

Same issue here. Any thoughts?

Since it seems like an exploration problem, I'm currently trying to tune the temperature parameter $\alpha$. Larger $\alpha$ leads to better exploration in some cases, but not always; however, it does lead to bad final reward in any converged cases.

@ritou11
Copy link

ritou11 commented Mar 25, 2020

Hi @hartikainen ,

I finally managed to make it work on MountainCarContinuous by adding additional noise to the actions of the behavior policy, in the same fashion DDPG does it. However, this did not solve my other problems ^^#

Another thing we have noticed is that there's a noticeable difference in sparse reward setting between setting non-success/success reward to -1/0 vs. 0/1.

Interesting, I tested and saw the same behavior you described. With sparse rewards, SAC + HER manages to find the optimum in the 0/1 setting whereas it fails in the -1/0 setting.

Do you have an idea of where does this issue with reward offset may come from?

such as simply normalizing the rewards in the environments

how did you try to normalize them exactly? (I also tried that using running averages but it did not help)

I briefly looked into that seemed promising, is the POP-ART normalization

I think I have to take a look at this paper, not the first I see it mentioned. I will maybe try that later (I'm focusing on finishing and testing HER re-implementation for stable-baselines right now).

PS. Awesome job implementing SAC in baselines! I was planning to do it myself at some point last year, but you got it out much faster

your welcome =) In fact, the release of spinning up and this repo accelerated the implementation.

Can you remember the scale of additional noise? Or any ideas? thx

@araffin
Copy link
Author

araffin commented Mar 25, 2020

Hello,

You can find working hyperparameters in the rl zoo, the noise standard deviation is quite high (0.5 compared to "classic" values of 0.1-0.2 normally used)

@ritou11
Copy link

ritou11 commented Mar 25, 2020

Hello,

You can find working hyperparameters in the rl zoo, the noise standard deviation is quite high (0.5 compared to "classic" values of 0.1-0.2 normally used)

Hi @araffin ,

Nice work there! I noticed you're using an automatic ent_coef with an OU noise of 0.5 to improve exploration.
After a day of trying, I finally understand the sparse reward difficulty here and why you said it was even "deceptive" reward. In your previous reply:

Interesting, I tested and saw the same behavior you described. With sparse rewards, SAC + HER manages to find the optimum in the 0/1 setting whereas it fails in the -1/0 setting.

Did you mean MountainCarContinuous-v0 could be solved by SAC + HER? (or it just solves other sparse reward envs and we have to use action noise to explore here)

@araffin
Copy link
Author

araffin commented Mar 25, 2020

automatic ent_coef

this is just for convenience, the external noise scale is what makes things work.

Did you mean MountainCarContinuous-v0 could be solved by SAC + HER?

Ah no, I was talking about environments tailored for HER.

@ritou11
Copy link

ritou11 commented Mar 25, 2020

Ah no, I was talking about environments tailored for HER.

That's too bad. I just briefly read about HER and was hoping HER to solve this.

But if we know the goal like HER does, we could implement reward shaping to lead the agent. In my limited numbers of experiments, an extra reward 0.1 * abs(goal - position) made SAC explore better. However, reward shaping changes the aim and should hinder the agent to explore in another direction. So I guess the improvement I saw was a coincidence......

Another thought is that we can improve the weight of less observed action-state pairs during the training (something like a continuous monte carlo search tree). I'll search for related papers and hopefully try this idea when I finish my project in hand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants