-
Hi, I am using Also, sometimes, throughput w/o using deepspeed seem to be higher than w/ deepspeed (using 8 GPUs to train). Would like to know the scenarios when deepspeed doesnt help much. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
The ZeRO optimizations in DeepSpeed are most helpful when:
The memory saving of ZeRO is a trade-off for increased communication. The memory saving and communication overhead increase with ZeRO stages. The communication overhead can hurt throughput of smaller models like t5-base, which don't benefit much from the memory savings. In such cases, it is probably better to disable ZeRO by setting the stage to |
Beta Was this translation helpful? Give feedback.
The ZeRO optimizations in DeepSpeed are most helpful when:
The memory saving of ZeRO is a trade-off for increased communication. The memory saving and communication overhead increase with ZeRO stages. The communication overhead can hurt throughput of smaller models like t5-base, which don't benefit much from the memory savings. In such cases, it is probably better to disable ZeRO by setting the stage to
0
. You might find the Flops Profiler or Autotuner helpful for your investigation.