-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PPORecurrent mini batch size inconsistent #113
Comments
Hello,
my guess is that it is because of padding/masking.
if you feed all sequences, then the mini batch size is not constant, which makes it harder to tune hyperparameters. If you feed sequences, you also don't know in advance how much memory you need for the storage, which makes it less efficient (related to DLR-RM/stable-baselines3#1059) as you will need to use list and not fixed size numpy array. |
Thanks for the answers! I checked and you are right, the masking always yields the correct sized obs/act. |
What you want to feed is (batch_size, obs_shape) sequentially:
So, if array could be of any sizes, then you would feed This shape is needed here: stable-baselines3-contrib/sb3_contrib/common/recurrent/policies.py Lines 177 to 179 in a9735b9
not really... but you can compare that implementation to SB2 implementation: https://github.com/hill-a/stable-baselines/blob/14630fbac70aaa633f8a331c8efac253d3ed6115/stable_baselines/ppo2/ppo2.py#L368-L380 In SB2, there is no padding but there is a constraint on the number of steps and number of minibatches, so that the sequence length is always the same. |
Thanks for your elaborate reply. It makes much more sense now. Although my models are training very well with SB3, I was still running into problems where it was taking way too long to train. PPORecurrent with an LSTM was running at only 104 fps on a 24 core machine, against normal PPO with an MLP at around 1000 fps on the same machine. Some profiling showed that the majority of comp time was spent on the backward pass. I was only able to speed it up to 181 fps by moving to a GPU machine (12 core + v100), which is still too slow for my project. So in an effort to try and speed up training I forked sb3-contrib and implemented a version that trains on batches of whole sequences, by feeding a (n_seq, seq_length, obs_shape) batch directly to the Torch api, instead of using the Python for loop in This made the backward pass 4 times faster. I tested this on my laptop GPU, the backward pass went from 20s to 5s. I then ran this in a few configurations on our cluster to find out what the real speedup is in terms of fps, and to confirm that the models actually are able to train well. It achieved an fps of around 500, on the same 12 core + v100 machine. About 2.8 times faster than the 181 fps before. Also the models achieved the same rewards with similar trajectory. Results of two 24h runs are shown below, green is standard sb3-contrib with 512 batch size, blue is my fork of sb3-contrib with batches of 8 whole sequences.
I was able to do this without resorting to lists, keeping the same setup for the rolloutbuffer as sb3. It is just a matter of indexing. The issue regarding variable sequence lengths in a batch is solved by padding sequences so the tensor becomes a cube a again. Below is the code I used to generate the batches:
If you are interested to see more about this, here is my fork. The main changes to look at are in Its a crude implementation just to try and test how well it works so I only implemented functionality I need for my project. If you're interested in this method I could do a proper implementation and submit a pull request for it. From what I am seeing it is not only a lot faster, but the code is also much simpler. |
This is expected.
you mean using: stable-baselines3-contrib/sb3_contrib/common/recurrent/policies.py Lines 182 to 187 in c75ad7d
I guess. If you do not take the done signal into account, you are not treating sequences properly (you are treating n_steps from one env as a big sequence, not matter how many episodes where in there).
good to hear, did you try on envs from https://wandb.ai/sb3/no-vel-envs/reports/PPO-vs-RecurrentPPO-aka-PPO-LSTM-on-environments-with-masked-velocity--VmlldzoxOTI4NjE4 ? |
I have indeed read that, however I am expanding on prior research so I am limited to LSTMs.
Yes you are right, that would probably also work very well in terms of speed, however that code will never be called when seq_length < batch_size.
I am using the done signal to cut the rollout buffer data into the original sequences, which are then batched.
Yes, I tried on gym BipedalWalker v3, which went from 752 fps to 1992 fps, a 2.6 times speedup: Orange is standard PPORecurrent and red is the 3d batching version. Results look quite similar but I didnt spend time on tuning hyperparams for the 3d batching version. Code for running this test can be found here It looks like 3d batching is between 2.5 - 3 times faster than what is currently implemented in sb3_contrib, while keeping same learning behaviour. If you're interested I can run more tests with different params/different envs. |
I meant on the envs from the benchmark (https://wandb.ai/sb3/no-vel-envs/reports/PPO-vs-RecurrentPPO-aka-PPO-LSTM-on-environments-with-masked-velocity--VmlldzoxOTI4NjE4) that actually require recurrence to work (because speed was removed from the observation). They are defined in the RL Zoo: https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/rl_zoo3/import_envs.py#L47-L62
mmh, BipedalWalker v3 doesn't require recurrence to be solve normally, and -90 is far from the optimal performance (around 300).
I need to take a look at that in more details. |
Fair points. I reran the test on BipedalWalker-v3 with the proper hyperparams (from sb3 zoo), and also ran a test on PendulumNoVel-v1. Here are the results: BipedalWalker-v3:
PendulumNoVel-v1:
I can run the other envs in the link too if you want to see those, but these results are pretty convincing to me and match what I am observing in the Mujoco env I am working on. Also you could run them yourself by adding them to this script. You can find my fork here if you want to take a closer look. So far I implemented it for both Box and Dict observation space and Box action space, other spaces probably don't work yet. |
thanks for the additional results =), could you open a draft PR? (that would make it easier for me to review the code) |
Of course! Here it is: #118 |
Thanks for the PR, I think I finally understand why it works.
|
Cool! Glad to hear that.
Good catch. I fixed it so that it also samples the last sequence in the rollout buffer: It should now sample all sequences in the rollout buffer. However, it will drop the final batch in an epoch if n_batches mod batch_size != 0 to prevent very small batches from causing updates. However the next epoch the batch indexes are randomized again, so it is unlikely the same sequences will be in that final batch again.
Agreed. It would be more correct to adjust the learning rate to whatever the current batch size is. In a sense it is kind of like a crude form of learning rate decay now, because the effect single timesteps have on weight updates decreases as a function of episode length(roughly). Would love to help out integrating if you decide on making this part of the package. |
I'm not sure to understand why the last batch would be much smaller than the rest (because you sample whole sequences every time).
I think I'm interested in integrating it (but not replacing current sampling). |
For example if it tries to sample batches of 8 sequences out of a total of 81 sequences, it will result in 10 full batches of 8 sequences, and 1 small batch with only 1 sequence. This is especially a problem when that particular sequence is very short, like only 1 or a few timesteps. Nice example here in the PyTorch docs Also, although it is true that the batching algorithm never cuts up sequences and keeps them whole, some sequences are cut up anyway because they started during a previous rollout so the first part is just not present in the rolloutbuffer.
Cool! In that case, would you want to make it optional to enable this batching method? Or make it a separate algorithm? When we decide on a general architecture I can start working on it. |
yes but sequences are of variable length (so you could sample 1 long sequence vs 3 short sequences).
make it optional (at least try, we will make separate algorithms if it becomes too complex). |
I integrated the code better with the existing code and removed the duplicates. Now about 120 lines smaller. Enabling it is still optional through the Curious what you think regarding the flatten layer. It also seems to me that we could simplify the |
No further work was done here? I'm curious to see here progress on the main sb3 contrib integration. |
Please have a look at #118 Help and further testing is needed. |
I am using
PPORecurrent
with theRecurrentDictRolloutBuffer
. In #103 it's mentioned that batch size is intended to be constant size. However this seems not to be the case.I did some experiments where I print the action batch sizes coming from the
_get_samples()
function.Exp 1: batch_size < sequence length
Initialize PPORecurrent with batch_size = 10. I am sampling only 2000 steps for debugging purposes.
The mean sequence length is 32.3
It looks like whenever n_seq = 2 and it appends 2 different sequences the batch size is higher than the specified 10.
Exp 2: batch_size > sequence length
Initialize PPORecurrent with batch_size = 100
The mean sequence length this time is 31.7
The batch size is now always higher than the specified 100, and different every time.
Is this the intended behavior? It seems wrong to me since it is stated explicitly in the aforementioned issue that batch size is intended to be constant:
Also, now that we are at it, what is the reason for implementing mini batching, instead of feeding batches of whole sequences to the model?
System Info
Describe the characteristic of your environment:
The text was updated successfully, but these errors were encountered: