-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent _s3_client from being serialized #847
Conversation
Hey @wouterzwerink, thanks for the contributions. Changes look fine to me and I'll kick off CI. Related: In some cases in the past, we've seen StreamingDataset not work with multiprocessing via the fork() method and so we usually always recommend using spawn(). Is there a particular reason you're using fork()? And can you verify that multiprocess dataloading (nonzero number of workers) is working fine even with fork() for you? @ethantang-db mind also taking a look? |
I'm using a private fork of VISSL, using DDP. It uses a forkserver by default (probably because it works nicely with SLURM?). Spawn should also work, but I had not tried it since forkserver has been working fine until 0.10.0. I'll give 0 workers a try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks for the update on this. Will approve after all test passes
Also possibly in the future we should do this with all the other boto3 clients. My suspicion as to why fork doens't work well is because those clients probably establish some pybind backend that utilizes C which are not pickle-able |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Description of changes:
The latest version (0.10.0) of this library crashes when using S3 as data source with forking as spawn method of the DataLoader, since the S3 session object can not be pickled.
This PR fixes it by excluding the session from being serialized.
Issue #, if available:
Merge Checklist:
Put an
x
without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
pre-commit
on my change. (check out thepre-commit
section of prerequisites)