CheckFreq pipelines checkpointing with computation for automated, frequent, fine-grained checkpointing in DNN training.
- Introduction
- Background
- The Current State of Checkpointing
- Checkpointing is Incorrect
- Checkpointing is Inefficient
- Summary
- CheckFreq: Design and Implementation
- Goals
- CheckFreq Recovery Guarantees
- Design
- Checkpointing Mechanism
- Checkpointing Policy
- Implementation
- Evaluation
- Experimental Setup
- Accuracy Implications
- Performance of Checkpointing Mechanism
- Checkpoint Stalls
- Breakdown of Benefits
- Checkpointing Policy
- Recovery Time
- End-to-End Training
- Discussion
- Related Work
- Conclusion
During DNN training, checkpointing is performed to ensure fault tolerance. Current checkpointing schemes are synchronous, thus leading to large checkpoint stalls. Furthermore, due to bigger models and larger datasets, epoch times are increasing. Typically, checkpointing is performed at epoch boundaries and the checkpointing frequency needs to be set manually. → We need fine-grained, iteration-level checkpointing.
CheckFreq is an automated checkpointing framework for DNN training.
CheckFreq decouples the traditional checkpointing into two phases: snapshot()
and persist()
. snapshot()
serializes the training state and copies it from the GPU memory to a user-space buffer in CPU memory. persist()
writes the serialized content to disk. These two phases are pipelined with DNN training computation.
In the optimal case, as the model weights are synchronized in the last phase of an iteration, we can pipeline the snapshot()
with the forward & backward pass of the next iteration, minimizing the checkpoint stall.
The authors also found that doing the snapshot on the GPU has an orders-of-magnitude lower cost than that on the CPU, as the latter involves a memory copy from GPU to CPU. Therefore, if spare GPU memories are available, the snapshot is done on the GPU memory.
Current data iterators do not guarantee the order of data items after resuming. CheckFreq resolves this by using a seed that is a function of the epoch number to reconstruct the shuffle order after resuming.
The key idea is to come up with a frequency of checkpointing every k iterations such that:
- The cost of 1 checkpoint can be amortized over k iterations
- The runtime overhead of checkpointing is within a user-defined threshold of the actual compute time (say 5%)
To accomplish this, CheckFreq profiles: the iteration time (Ti), time to perform weight update (Tw), time to create an in-memory GPU copy (Tg), time to create an in-memory CPU copy (Tc), time to write to storage (Ts), size of checkpoint (m), peak GPU memory utilization (M), and total GPU memory (Mmax). Then, the frequency is determined as follows:
Consider the following example
- Isolated: When a job runs alone, the checkpointing overhead is kept at 5% as specified by the user
- Static: When another job space-shares the same GPU, checkpointing at the previous frequency results in a 35% overhead
- Adaptive: CheckFreq's adaptive policy reduced the checkpoint frequency and keeps the overhead at 5%