Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent _s3_client from being serialized #847

Merged
merged 5 commits into from
Dec 9, 2024

Conversation

wouterzwerink
Copy link
Contributor

Description of changes:

The latest version (0.10.0) of this library crashes when using S3 as data source with forking as spawn method of the DataLoader, since the S3 session object can not be pickled.

This PR fixes it by excluding the session from being serialized.

Issue #, if available:

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the contributor guidelines
  • This is a documentation change or typo fix. If so, skip the rest of this checklist.
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
  • I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

  • I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
  • I have added tests that prove my fix is effective or that my feature works (if appropriate).
  • I ran the tests locally to make sure it pass. (check out testing)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

@snarayan21
Copy link
Collaborator

Hey @wouterzwerink, thanks for the contributions. Changes look fine to me and I'll kick off CI.

Related: In some cases in the past, we've seen StreamingDataset not work with multiprocessing via the fork() method and so we usually always recommend using spawn(). Is there a particular reason you're using fork()? And can you verify that multiprocess dataloading (nonzero number of workers) is working fine even with fork() for you?

@ethantang-db mind also taking a look?

@wouterzwerink
Copy link
Contributor Author

Hey @wouterzwerink, thanks for the contributions. Changes look fine to me and I'll kick off CI.

Related: In some cases in the past, we've seen StreamingDataset not work with multiprocessing via the fork() method and so we usually always recommend using spawn(). Is there a particular reason you're using fork()? And can you verify that multiprocess dataloading (nonzero number of workers) is working fine even with fork() for you?

@ethantang-db mind also taking a look?

I'm using a private fork of VISSL, using DDP. It uses a forkserver by default (probably because it works nicely with SLURM?). Spawn should also work, but I had not tried it since forkserver has been working fine until 0.10.0.

I'll give 0 workers a try

ethantang-db
ethantang-db previously approved these changes Dec 9, 2024
Copy link
Contributor

@ethantang-db ethantang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for the update on this. Will approve after all test passes

@ethantang-db
Copy link
Contributor

Also possibly in the future we should do this with all the other boto3 clients. My suspicion as to why fork doens't work well is because those clients probably establish some pybind backend that utilizes C which are not pickle-able

@ethantang-db ethantang-db self-requested a review December 9, 2024 17:26
@ethantang-db ethantang-db dismissed their stale review December 9, 2024 17:34

blocking until all tests passes

Copy link
Collaborator

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linting

streaming/base/storage/download.py Outdated Show resolved Hide resolved
streaming/base/storage/download.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@snarayan21 snarayan21 merged commit 69304c5 into mosaicml:main Dec 9, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants