Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for TorchSnapshot for efficient checkpoint saving and loading #2752

Open
ananthsub opened this issue Oct 24, 2022 · 2 comments
Open

Comments

@ananthsub
Copy link

ananthsub commented Oct 24, 2022

🚀 Feature

TorchSnapshot is a performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind. It includes many optimizations to control for memory usage and optimize checkpoint writing for DDP-style workloads over torch.save/torch.load. For more information, please check out the readme: https://github.com/pytorch/torchsnapshot#why-torchsnapshot

This could be a nice addition to Ignite, similar to the existing Checkpoint handler

cc @yifuwang

@ananthsub ananthsub changed the title Support TorchSnapshot for efficient checkpoint saving and loading Support for TorchSnapshot for efficient checkpoint saving and loading Oct 24, 2022
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 27, 2022

@ananthsub thanks for suggesting this feature! Let us get a bit familiar with torch snapshot and see how this can be integrated to ignite.

A question I have about the usage, in DDP user should call Snapshot.take by all ranks ? How about the path specified in the argument, where it should be, node 0, rank 0 ?

@ananthsub
Copy link
Author

A question I have about the usage, in DDP user should call Snapshot.take by all ranks ?

Yes, Snapshot.take should always be called on all ranks in a distributed setting. It acts as a collective.

How about the path specified in the argument, where it should be, node 0, rank 0 ?

The path specified should be a directory, which should be the same across all ranks. If on a multi-node setting, this assumes you have a storage system visible by all nodes (e.g. a cloud storage object store)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants