Replies: 1 comment
-
continue ...
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello I am trying to train T5 model using Pytorch lightning on a source length 4000 and target length of 1000 but when I am training it gives me Cuda of memory I am using sage maker instance ml.g4dn.4xlarge with single GPU and 64 GPU Memory. I have also tried gradient accumulation and checkpointing but they do not work and mixed precision leads training loss to nan. Can you please guide me as to what should I do ?
Sage maker instance
Error Msg : OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB (GPU 0; 14.62 GiB total capacity; 13.11 GiB already allocated; 4.00 MiB free; 13.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Beta Was this translation helpful? Give feedback.
All reactions