Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CheckpointManager friendlier to custom StorageWriter/StorageReader #789

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

dimdi-y
Copy link

@dimdi-y dimdi-y commented Jan 12, 2025

The goal of this PR is to make using custom torch.distributed.checkpoint.storage.StorageReader/StorageWriters with torchtitan easier.
For example, using S3StorageWriter from s3-connector-for-pytorch would allow storing checkpoints in AWS S3.

Two changes needed to be made to allow this:

  1. Allow passing storage_reader and storage_writer into the CheckpointManager.
  2. Isolate the usage of os.path module under overridable methods (as this logic is specific to the default torch.distributed.checkpoint.filesystem.FileSystemReader/FileSystemWriter).

I also added basic tests for CheckpointManager to make sure I didn't break the original behavior

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 12, 2025
@dimdi-y dimdi-y changed the title Dimdi y/extendable checkpointing Make CheckpointManager friendlier to custom StorageWriter/StorageReader Jan 13, 2025
@tianyu-l tianyu-l requested review from fegin and tianyu-l January 13, 2025 22:14
@dimdi-y
Copy link
Author

dimdi-y commented Jan 13, 2025

I see that CI fails on my new tests as they use CUDA and CI does not support that.
But the test for async mode fails on CPU (I reported the underlying bug in this issue).

Should I mark the test for async mode as skipped and move to CPU?

@@ -402,24 +438,56 @@ def sync_func():
sync_func()
self.staging = False

def _check_checkpoint_exitsts(self, step: int) -> bool:
Copy link
Contributor

@lessw2020 lessw2020 Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR!
I wanted to note that this should be "exists" rather than the current spelling, "exitsts"...thus _check_checkpoint_exists.
In a way, it's a nit but it's also important as misspelled apis can create confusion (have personally hit such things where type in correct spelling but api has mis-spelling and not obvious failures result.
There are 2 calls here to this function that would need to also be update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor point - '_verify_checkpoint_exists' would be smoother name (vs check_checkpoint).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thanks! Will surely fix this.

@fegin
Copy link
Contributor

fegin commented Jan 13, 2025

Isn't fsspec enough for S3 usage? If fsspec is installed, DCP should also accept S3 path.

@dimdi-y
Copy link
Author

dimdi-y commented Jan 14, 2025

@fegin

Thank you for taking a look!

Let me provide some more motivation here to answer your question:

Isn't fsspec enough for S3 usage? If fsspec is installed, DCP should also accept S3 path.

I haven't checked this myself yet, but it should work with DCP itself, yes.

But:

  1. A user might still want to pass a custom StorageWriter/StorageReader implementation (e.g. adding retries for S3 uploads) or a custom-configured version of the default (e.g. changing the default parameters of FileSystemWriter)
  2. Some of the functionality in the CheckpointManager assumes a local filesystem (e.g. os.listdir is used to locate the last checkpoint to load), which needs to be overridden if a user wants to use S3 for checkpointing.

addressing the above points is the main focus of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants