Finetuning - CUDA OUT OF MEMORY ISSUE #1102

amnasher · 2023-06-08T18:21:56Z

amnasher
Jun 8, 2023

Hello I am trying to train T5 model using Pytorch lightning on a source length 4000 and target length of 1000 but when I am training it gives me Cuda of memory I am using sage maker instance ml.g4dn.4xlarge with single GPU and 64 GPU Memory. I have also tried gradient accumulation and checkpointing but they do not work and mixed precision leads training loss to nan. Can you please guide me as to what should I do ?
Sage maker instance

Error Msg : OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB (GPU 0; 14.62 GiB total capacity; 13.11 GiB already allocated; 4.00 MiB free; 13.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

omerdotdev · 2023-06-09T11:57:59Z

omerdotdev
Jun 9, 2023

continue ...

Error Msg : OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB (GPU 0; 14.62 GiB total capacity; 13.11 GiB already allocated; 4.00 MiB free; 13.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning - CUDA OUT OF MEMORY ISSUE #1102

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Finetuning - CUDA OUT OF MEMORY ISSUE #1102

amnasher Jun 8, 2023

Replies: 1 comment

omerdotdev Jun 9, 2023

amnasher
Jun 8, 2023

omerdotdev
Jun 9, 2023