Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Bigger, Regularized, Optimistic (BRO) #60

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

naumix
Copy link

@naumix naumix commented Nov 10, 2024

Addition of the BRO algorithm (https://arxiv.org/abs/2405.16158)

Description

Motivation and Context

  • I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist:

  • I've read the CONTRIBUTION guide (required)
  • I have updated the changelog accordingly (required).
  • My change requires a change to the documentation.
  • I have updated the tests accordingly (required for a bug fix or a new feature).
  • I have updated the documentation accordingly.
  • I have reformatted the code using make format (required)
  • I have checked the codestyle using make check-codestyle and make lint (required)
  • I have ensured make pytest and make type both pass. (required)
  • I have checked that the documentation builds using make doc (required)

Note: You can run most of the checks using make commit-checks.

Note: we are using a maximum length of 127 characters per line

@naumix naumix closed this Nov 10, 2024
@araffin araffin reopened this Nov 10, 2024
@araffin
Copy link
Owner

araffin commented Nov 10, 2024

@naumix thanks for the PR, but you don't have to close it to update it, just push on the same branch.

@araffin araffin mentioned this pull request Nov 10, 2024
Closed
14 tasks
@naumix
Copy link
Author

naumix commented Nov 12, 2024

Thanks, first time merging from a fork

@araffin
Copy link
Owner

araffin commented Nov 25, 2024

Hello,

I've been playing with BRO on different environments (comparing it to Simba and variants) and so far I have the feeling that BRO vs Simba is similar to REDQ vs Droq, in the sense that BRO paved the way for Simba (the same way REDQ paved the way for DroQ), but Simba is more practical/simpler (just replace the network architecture).

In all the small tests I've done (different envs, differents simulators), TQC + Simba + RR=10 + policy_delay=10 consistently performs on-par or better than BRO while having a similar runtime (I used TQC to also have a distributional RL algo).
So my question would be, why should one choose BRO over Simba?

(from a practical point of view, from a research point of view, BRO was a helpful steps that enabled Simba)

Side note: it also seems that resets are currently not implemented?

@naumix
Copy link
Author

naumix commented Dec 3, 2024

Hey! I'm happy you found the scaled architectures working well on your problems. Times of bigger RL models are coming 😄

BRO paved the way for Simba (the same way REDQ paved the way for DroQ), but Simba is more practical/simpler (just replace the network architecture)

Indeed, Simba is based on small changes to the Bronet architecture which is a part of the BRO algorithm. Whereas BRO can be described as SAC + Quantile Q-learning + Bronet + RR = 2 or 10, the Bronet architecture can be used with any other algorithm just like Simba.

So my question would be, why should one choose BRO over Simba?

In terms of applied RL, I think that practitioner would try out a suite of SOTA methods to find which one fits their problems best. Since BRO was shown to perform very competitively in three independent papers, it might be a go-to for at least a few people - since it is not common recorded knowledge that TQC + Simba + RR=10 + policy_delay=10 is a good proposition, it may not be an off-the-shelf choice.

In terms of academic RL, I think that researchers might be interested in using BRO to benchmark future improvements in RL. Since the contemporary reviewing process expects the researchers to use established algorithms as baselines, it might be easier to ask researchers to run BRO than it is to ask to run variations of modern algorithms that are not common (e.g. TQC + Simba + RR=10 + policy_delay=10). What also makes BRO results easier to contextualize is that we ran all environments with single hyperparameters, whereas Simba did some tiny but impactful changes. For example, Simba authors use a single critic network for DMC and MyoSuite, and two critic networks for HumanoidBench. While this seems to be an irrelevant choice, it implies whether the critic will be updated with pessimistic Clipped Double Q-learning targets or optimistic mean of a single target network - and this design choice was shown to have a great impact on DMC performance (https://arxiv.org/abs/2403.00514).

In terms of raw code features, BRO offers the implementation of Bronet, as well as a more general implementation of the pessimistic updates used in algorithms like TD3/SAC/TQC. Our implementation allows users to experiment with different levels of pessimism in Q-value updates, a feature which was shown to greatly impact the performance of actor-critic algorithms (e.g. https://arxiv.org/abs/2102.03765). The implementation of pessimism that we used in BRO is particularly fit for SBX since it is robust with respect to the ensemble size of critics (https://arxiv.org/abs/2110.03375).

Side note: it also seems that resets are currently not implemented?

Yes, I did not implement resets in the pull since I wanted maximal simplicity and it resulted in a semi-elegant solution. However, I am happy to update the pull and add resets - this will have a big performance impact for RR>2 (btw, when comparing BRO to TQC + Simba + RR=10 + policy_delay=10, did you increase BRO replay ratio to 10?)

@araffin
Copy link
Owner

araffin commented Dec 11, 2024

comparing BRO to TQC + Simba + RR=10 + policy_delay=10, did you increase BRO replay ratio to 10

yes I did ;)

Yes, I did not implement resets in the pull

I implemented them for SAC and others here: 9387b64

SOTA methods to find which one fits their problems best.

I would agree if BRO was solving a problem not solved by others.
For instance, TQC and SAC have usually similar performance but on some harder env (like bipedalwalkerhardcore), TQC gives a huge performance boost.
So far, BRO is on par or worse than Simba on the different locomotion envs I tried, and same goes for envs where BRO was tuned on (according to Simba paper).

the Bronet architecture can be used with any other algorithm just like Simba.

That's true. But again, Bronet was shown (Simba paper section 5 I think) to have similar or worse performance than Simba net. Having both doesn't seem to make much sense.

While this seems to be an irrelevant choice,

Having small changes between benchmark suite don't seem unreasonable, although I agree this one tend to have a great impact.
You yourself talked about adapting BRO for MuJoCo envs in our previous email exchanges.

I think that researchers might be interested in using BRO to benchmark future improvements in RL
In terms of raw code features, BRO offers the implementation of Bronet

This is also possible without integrating BRO into SBX.
We can at least put a link in the readme to your SBX implementation.
My other current worry is that the SBX implementation was not benchmarked against the original BRO implementation (I don't have the resources for that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants