-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issue with training on Monai with large datasets #59
Comments
This is indeed strange. Can you also post the exact arguments that you're using with |
Pretty much no argument, just correct paths to data and default unet, 100 epochs, and your hard coded values. |
@louisfb01 can you please list in this issue thread the various discussions on this topic in MONAI GH, slack, forums, etc. |
The source seems to be a memory leak when storing to much data from a dataloader, from this. But in his case it happens running the test set whereas here it happens when training. Solutions tried:
New temporary fix:
|
@louisfb01 can you please
|
Here is more information about the environment (python version 3.9.17) and the output I get from running main.py (training with PyTorch, PyTorch lightning, MONAI). STDOUT
Environment details (pip list)
|
Updated the answer above with a new temporary fix. Adding this line to the training script did work It is not super clean but it at least allows us to train normally for now. |
Why is that? |
Nevermind on that. I didn't like the idea of having to add this line and thought it was a "hard-coded" fix to a PyTorch issue, but it seems like a normal behaviour after further research. This issue can be closed with the solution of adding the |
I just realized this magic syntax also fixed an issue for me in the past 😅 |
Is there any deeper explanation anywhere as to why this fix is working? |
This is what I found: torch.multiprocessing is a wrapper around the native multiprocessing module. It registers custom reducers, that use shared memory to provide shared views on the same data in different processes. Once the tensor/storage is moved to shared_memory (see share_memory_()), it will be possible to send it to other processes without making any copies. (from torch documentation) And from what I understand, this memory issue comes from using the CacheDataset and it has to do with pytorch's sharing strategy. The function It still seems to be a temporary fix to "having a high enough limit" in our system. This can be done increasing Quote from PyTorch doc:
This one referring to using the |
ah this is a good point! Monai also provides PyTorch's native Dataset class. Could you please try using that once and remove the multiprocessing fix to see if it is really CacheDataset that's the culprit? (you just have switch to Dataset and use the right arguments, just a line of code) |
I tried using MONAI's Dataset class instead of CacheDataset. Using Dataset does not allow us to remove |
New error even with Happens both when using MONAI's Dataset class and CacheDataset. The error occurs with all the aggregated datasets (approx. 7k total images from train/val/test) and it only happens at 73% (with CacheDataset) and 75% (with Dataset) of the first epoch:
Plus, the epoch takes over 1:30 hours to run. I think the next step is to investigate using compute Canada and train with more GPUs and memory. Anyways, as @naga-karthik and I saw, Romane is getting pretty crowded nowadays and it is hard to train when you want. |
👍 |
Regarding the issue above, I looked further into it, and it seems nobody can explain it (?). Some threads are also still opened, not finding the cause of the issue with another fix (will update the comment once I've tried them):
Another potential solution is to implement our own version of the dataset class and implement this modification to transform the data into torch tensors. -> did not work. Now working on implementing the code for Compute Canada to train with multiple GPUs and will update if we have the same problem. |
closing this as it is not relevant anymore -- I was able to train the model on 11 datasets at the moment |
Hi @naga-karthik ! I am re-opening this issue as I am facing the same problem as @louisfb01 when training using monai and large datasets. My training crashes during the validation step even though I am using The error I get is the following: Error messageTraceback (most recent call last):
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in _pin_memory_loop
do_one_step()
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in do_one_step
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 324, in rebuild_storage_filename
storage = torch.UntypedStorage._new_shared_filename_cpu(manager, handle, size)
RuntimeError: unable to mmap 68 bytes from file </torch_1405374_144476348_9956>: Cannot allocate memory (12)
Traceback (most recent call last):
File "/home/plbenveniste/ms_lesion_agnostic/ms-lesion-agnostic/monai/train_monai_unet_lightning.py", line 823, in <module>
main()
File "/home/plbenveniste/ms_lesion_agnostic/ms-lesion-agnostic/monai/train_monai_unet_lightning.py", line 815, in main
trainer.fit(pl_model)
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run
results = self._run_stage()
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1033, in _run_stage
self.fit_loop.run()
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
self.advance()
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 141, in run
self.on_advance_end(data_fetcher)
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 295, in on_advance_end
self.val_loop.run()
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 128, in run
batch, batch_idx, dataloader_idx = next(data_fetcher)
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/fetchers.py", line 133, in __next__
batch = super().__next__()
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/fetchers.py", line 60, in __next__
batch = next(self.iterator)
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
out = next(self._iterator)
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/utilities/combined_loader.py", line 142, in __next__
out = next(self.iterators[0])
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
data = self._next_data()
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
idx, data = self._get_data()
File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1289, in _get_data
raise RuntimeError('Pin memory thread exited unexpectedly')
RuntimeError: Pin memory thread exited unexpectedly
I had to set num_workers to zero in the validation to get the code to work. Is there any cleaner one around this ? (Btw I am running this on this server: |
@plbenveniste Have you had any luck with finding a solution? |
Hey All! So as discussed with @naga-karthik I've had some issues with the "aggregated training".
I am running into memory issues where no fixes seem to work.
I am using Naga's training script and tried pretty much all solutions I could find, but I always get a
RuntimeError('Pin memory thread exited unexpectedly')
. This can be fixed by using 0 workers in these lines, but makes the training super slow (15+ mins per epoch).I tried with the exact same configurations as Naga too. The only difference is the amount of images. Plus, it works when removing most images from the dataset.json file (so when working with a much smaller set).
I am still investigating this issue...
The text was updated successfully, but these errors were encountered: