-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] LSTM policies with custom feature extractors #160
Comments
Hello, |
How would recurrent policies be approached here? Any recommendations? |
I think the implementation style of SB2 would be a good starting point, and there is already some support for that in SB3 (e.g. predict returns Edit: One thing we might want to consider is updating buffers to always store trajectory-information and/or even keep different trajectories separate to make this easier. |
@Miffyli
or
? |
I do not dare to give any guides just yet. PyTorch makes things much easier when it comes to RNN-type of networks and we are able to save ourselves from a ton of headache compared to TF if we do things right. On a quick ponder I think we might be able to implement RNN support in a very low level in policies (e.g. arguments to My point being: First step for RNN support would be to think of clean and functionally satisfactory ("RNN agents that work") solutions before starting to think of code level things. |
Shouldn't this be part of the breaking changes on the migration guide? At least in the meantime |
@juancroldan do you mean something like this: " |
How does recurrent policy differ from actually implementing custom policy with LSTM/GRU network and just prepare your state in a way that it will already contain previous experience/timeline? |
@glebarez Storing hidden states during training, and the whole backprop through time, is the largest headache here. We need to update the code to store the hidden states between rollouts, load them up correctly and then do the training. Some of this infrastructure is there, so it is not that big of a deal (you can get plenty of hints from SB2), but only implementing it on the policy level is not enough (training needs to be updated as well). |
@Miffyli thank you for taking time in making explanation. |
Yup the plan is to have much simpler/clearer RNN implementation this time around, using the things we learned during SB2 :). PS: If you feel like working towards anything like this, PRs are always welcome! |
Recurrent policies are not supported yet as of (DLR-RM#160 (comment)), but the docs say that A2C supports them. Changing it to avoid misleading.
* Remove recurrent policies from A2C docs Recurrent policies are not supported yet as of (#160 (comment)), but the docs say that A2C supports them. Changing it to avoid misleading. * Update changelog Co-authored-by: benjaminjsteenhoek@gmail.com <benjis@iastate.edu>
any updates on recurrent policies? |
No active development as of yet. You could try using original stable-baselines algorithms if those are suitable for you. |
* Remove recurrent policies from A2C docs Recurrent policies are not supported yet as of (DLR-RM#160 (comment)), but the docs say that A2C supports them. Changing it to avoid misleading. * Update changelog Co-authored-by: benjaminjsteenhoek@gmail.com <benjis@iastate.edu>
I have a very experimental version of recurrent PPO in a SB3 contrib branch based on SB2/cleanRL implementation: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/compare/feat/ppo-lstm Use it at your own risk :p closing in favor of #18 |
* Remove recurrent policies from A2C docs Recurrent policies are not supported yet as of (DLR-RM/stable-baselines3#160 (comment)), but the docs say that A2C supports them. Changing it to avoid misleading. * Update changelog Co-authored-by: benjaminjsteenhoek@gmail.com <benjis@iastate.edu>
Hi! It would be awesome to be able to implement LSTM policies in this library, like in the former version. Is there an straightforward way to accomplish this with the current version?
The text was updated successfully, but these errors were encountered: