Release Notes

NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to failures and interruptions.

NVIDIA Resiliency Extension v0.2

Highlights

We excited to introduce many new features in NVIDIA Resiliency Extension v0.2

In-process restart – Provides a mechanism to restart the training without killing the running process via a Python function wrapper. Compared to a traditional scheduler-level restart, restarting within the same process removes overheads associated with launching a new scheduler job, starting a container, initializing a new Python interpreter, loading dependencies, and creating a new CUDA context.
Asynchronous checkpoint - Provides core utilities to make checkpointing routines run in the background. It uses torch.multiprocessing to fork a temporary process to initiate asynchronous checkpointing routine. Application can check this asynchronous checkpoint save in a non-blocking manner and specify a user-defined finalization step when all ranks finish their background checkpoint saving.
Local checkpoint - Provides a mechanism to create a checkpoint in local host memory. The local checkpointing mechanism is implemented via the Python LocalCheckpointManager class, which operates on a TensorAwareStateDict wrapper. This wrapper encapsulates the operations necessary for efficient replication and data transfers.

Known Issues & Limitations

For in-process restart - If there is hang, presence of SHARP raises an exception, which leads to triggering in-job restart and in-process restart. Customer needs to disable SHARP for using in-process restart with current version. Requires ENV VARs to be set as follows:
NCCL_NVLS_ENABLE=0 to disable SHARP.
NCCL_NET_PLUGIN="none" if NCCL version < 2.24.1 to avoid duplicate NCCL net plugin init.
In-process and in-job restart works with PyTorch version 24.07, 24.08, 24.09, and 24.10 but not with 24.11 due to a known NCCL issue

Contributors

@grzegorz-k-karch @jbieniusiewi @j-szulc @mikolajblaz @sbak @skierat @srogawski-nvidia @szmigacz @yzhautouskay

NVIDIA Resiliency Extension v0.1.3

Highlights

We are excited to announce the first release of NVIDIA Resiliency Extension v0.1.3!

Straggler Detection API provides tools for user to mark the section of code and configure the threshold to detect slow running GPU.

Fault Tolerance API provides the rank monitor server and client, and modified torchrun launcher based on TorchElastic to automatically detect hang and ability to in-job restart the training.

Contributors