Memory issue with training on Monai with large datasets #59

louisfb01 · 2023-07-12T16:57:45Z

Hey All! So as discussed with @naga-karthik I've had some issues with the "aggregated training".

I am running into memory issues where no fixes seem to work.

I am using Naga's training script and tried pretty much all solutions I could find, but I always get a RuntimeError('Pin memory thread exited unexpectedly'). This can be fixed by using 0 workers in these lines, but makes the training super slow (15+ mins per epoch).

I tried with the exact same configurations as Naga too. The only difference is the amount of images. Plus, it works when removing most images from the dataset.json file (so when working with a much smaller set).

I am still investigating this issue...

The text was updated successfully, but these errors were encountered:

naga-karthik · 2023-07-12T19:01:14Z

This is indeed strange. Can you also post the exact arguments that you're using with main.py?

louisfb01 · 2023-07-12T20:03:56Z

Pretty much no argument, just correct paths to data and default unet, 100 epochs, and your hard coded values.

jcohenadad · 2023-07-14T14:19:23Z

@louisfb01 can you please list in this issue thread the various discussions on this topic in MONAI GH, slack, forums, etc.

louisfb01 · 2023-07-16T16:10:33Z

The source seems to be a memory leak when storing to much data from a dataloader, from this. But in his case it happens running the test set whereas here it happens when training.

Solutions tried:

~~- Adding this line to the training script did not work torch.multiprocessing.set_sharing_strategy('file_system'), as in this issue.~~

Increasing ulimit was another solution but it is limited to 2000 for a user in Romane. Increasing from 1024 to 2000 did not solve the problem.
Changing pin_memory to False in those lines did not solve the issue, rather caused another problem received 0 items of ancdata, similar to here and this thread.
Changing the training batch size to 1,2,4 did not help.
Changing the number of workers to 1,2,4,10 did not help.
Changing CacheDataset's number of workers did not help either.
Changing CacheDataset's cache_rate to 0.1 instead of 0.25 did not help.

~~The current workaround is to use num_workes=0 as in this issue and this one, and more.~~

New temporary fix:

Adding this line to the training script did work torch.multiprocessing.set_sharing_strategy('file_system'), as in this issue. I believe I previously did not add it in the right section of the code by mistake. It now works as expected!

jcohenadad · 2023-07-17T16:03:59Z

@louisfb01 can you please

show us the full STDOUT
open issue on pytorch and describe your issue
cross-ref with this issue so we can follow-up

louisfb01 · 2023-07-18T14:09:05Z

Here is more information about the environment (python version 3.9.17) and the output I get from running main.py (training with PyTorch, PyTorch lightning, MONAI).

STDOUT

(monai_training) lobouz@romane:~/github/contrast-agnostic-softseg-spinalcord/monai$ CUDA_VISIBLE_DEVICES=3 python main.py -m unet -nspv 4 -ncv 1 -initf 8 -bs 4 -lr 1e-3 -cve 4 -stp -epb 
Global seed set to 42
2023-07-18 10:01:13.007 | INFO     | __main__:main:472 -  Training on fold 1 out of 1 folds! 
/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/utilities/parsing.py:196: UserWarning: Attribute 'loss_function' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['loss_function'])`.
  rank_zero_warn(
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.5
wandb: Run data is saved locally in /home/GRAMES.POLYMTL.CA/lobouz/contrast-agnostic/saved_models/wandb/run-20230718_100119-9xlhu2uf
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run unet_nf=8_nrs=2_lr=0.001_20230718-1001
wandb: ⭐️ View project at https://wandb.ai/whats_ai/contrast-agnostic
wandb: 🚀 View run at https://wandb.ai/whats_ai/contrast-agnostic/runs/9xlhu2uf
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/monai/utils/deprecate_utils.py:321: FutureWarning: monai.transforms.io.dictionary LoadImaged.__init__:image_only: Current default value of argument `image_only=False` has been deprecated since version 1.1. It will be changed to `image_only=True` in version 1.3.
  warn_deprecated(argname, msg, warning_category)
Loading dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 476/476 [01:45<00:00,  4.53it/s]
Loading dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 119/119 [00:37<00:00,  3.20it/s]
Loading dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29/29 [00:10<00:00,  2.78it/s]
You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [3]

  | Name          | Type         | Params
-----------------------------------------------
0 | net           | UNet         | 1.2 M 
1 | loss_function | SoftDiceLoss | 0     
-----------------------------------------------
1.2 M     Trainable params
0         Non-trainable params
1.2 M     Total params
4.809     Total estimated model params size (MB)
Sanity Checking DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.34s/it]Current epoch: 0
Average Soft Dice (VAL): 0.0071
Average Hard Dice (VAL): 0.0020
Best Average Soft Dice: 0.0071 at Epoch: 0
----------------------------------------------------
Epoch 0:   3%|████▌                                                                                                                                                      | 14/477 [00:23<12:59,  1.68s/it, v_num=u2uf]Exception in thread Thread-28:
Traceback (most recent call last):
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in _pin_memory_loop
    do_one_step()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 307, in rebuild_storage_fd
    fd = df.detach()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/multiprocessing/reduction.py", line 164, in recvfds
    raise RuntimeError('received %d items of ancdata' %
RuntimeError: received 0 items of ancdata
Traceback (most recent call last):
  File "/home/GRAMES.POLYMTL.CA/lobouz/github/contrast-agnostic-softseg-spinalcord/monai/main.py", line 605, in <module>
    main(args)
  File "/home/GRAMES.POLYMTL.CA/lobouz/github/contrast-agnostic-softseg-spinalcord/monai/main.py", line 516, in main
    trainer.fit(pl_model)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 531, in fit
    call._call_and_handle_interrupt(
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 570, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 975, in _run
    results = self._run_stage()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1018, in _run_stage
    self.fit_loop.run()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 201, in run
    self.advance()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 354, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 189, in advance
    batch = next(data_fetcher)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/loops/fetchers.py", line 136, in __next__
    self._fetch_next_batch(self.dataloader_iter)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/loops/fetchers.py", line 150, in _fetch_next_batch
    batch = next(iterator)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/utilities/combined_loader.py", line 284, in __next__
    out = next(self._iterator)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/utilities/combined_loader.py", line 65, in __next__
    out[i] = next(self.iterators[i])
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
    idx, data = self._get_data()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1290, in _get_data
    raise RuntimeError('Pin memory thread exited unexpectedly')
RuntimeError: Pin memory thread exited unexpectedly

Environment details (pip list)

(monai_training) lobouz@romane:~$ pip list
Package                       Version
----------------------------- --------------
absl-py                       1.1.0
aiohttp                       3.8.4
aiosignal                     1.3.1
appdirs                       1.4.4
astor                         0.8.1
asttokens                     2.2.1
astunparse                    1.6.3
async-timeout                 4.0.2
attrs                         21.2.0
awscli                        1.22.34
backcall                      0.2.0
backports.functools-lru-cache 1.6.5
beautifulsoup4                4.11.2
beniget                       0.4.1
bids-validator                1.9.9
blinker                       1.4
botocore                      1.23.34
Brotli                        1.0.9
brz-etckeeper                 0.0.0
cachetools                    5.2.0
certifi                       2023.5.7
cffi                          1.15.1
chardet                       4.0.0
charset-normalizer            3.1.0
click                         8.1.3
cmake                         3.26.4
colorama                      0.4.6
coloredlogs                   15.0.1
comm                          0.1.3
command-not-found             0.3
commonmark                    0.9.1
contourpy                     1.1.0
cryptography                  3.4.8
csv-diff                      1.1
cycler                        0.11.0
dbus-python                   1.2.18
debugpy                       1.6.7
decorator                     5.1.1
Deprecated                    1.2.13
dictdiffer                    0.9.0
dill                          0.3.5.1
distlib                       0.3.4
distro                        1.7.0
distro-info                   1.1build1
dnspython                     2.1.0
docker                        5.0.3
docker-compose                1.29.2
docker-pycreds                0.4.0
dockerpty                     0.4.1
docopt                        0.6.2
docutils                      0.17.1
entrypoints                   0.4
executing                     1.2.0
filelock                      3.12.2
flatbuffers                   2.0.7
fonttools                     4.40.0
formulaic                     0.3.4
frozenlist                    1.3.3
fsleyes                       1.7.0
fsleyes-props                 1.8.2
fsleyes-widgets               0.14.2
fslpy                         3.13.0
fsspec                        2023.6.0
gast                          0.4.0
gdown                         4.6.4
gitdb                         4.0.10
GitPython                     3.1.31
gmpy2                         2.1.2
google-auth                   2.19.0
google-auth-oauthlib          1.0.0
google-pasta                  0.2.0
gpg                           1.16.0-unknown
grpcio                        1.54.2
h5py                          3.7.0
httplib2                      0.20.2
humanfriendly                 10.0
humanize                      4.4.0
idna                          3.4
imageio                       2.22.4
imgaug                        0.2.5
importlib-metadata            6.8.0
importlib-resources           6.0.0
interface-meta                1.3.0
iotop                         0.6
ipykernel                     6.24.0
ipython                       8.14.0
ivadomed                      2.9.7
jax                           0.4.11
jedi                          0.18.2
jeepney                       0.7.1
Jinja2                        3.1.2
jmespath                      0.10.0
joblib                        1.3.0
jsonschema                    3.2.0
jupyter_client                8.3.0
jupyter_core                  5.3.1
keras                         2.12.0
Keras-Preprocessing           1.1.2
keyring                       23.5.0
kiwisolver                    1.4.4
launchpadlib                  1.10.16
lazr.restfulclient            0.14.4
lazr.uri                      1.0.6
libclang                      14.0.1
lightning-utilities           0.9.0
lit                           16.0.6
loguru                        0.7.0
Markdown                      3.3.6
MarkupSafe                    2.1.3
matplotlib                    3.7.2
matplotlib-inline             0.1.6
ml-dtypes                     0.1.0
monai                         1.2.0
monai-weekly                  1.2.dev2311
more-itertools                8.10.0
mpmath                        1.3.0
multidict                     6.0.4
nest-asyncio                  1.5.6
netifaces                     0.11.0
networkx                      3.1
nibabel                       5.1.0
num2words                     0.5.12
numpy                         1.25.0
nvidia-cublas-cu11            11.10.3.66
nvidia-cuda-cupti-cu11        11.7.101
nvidia-cuda-nvrtc-cu11        11.7.99
nvidia-cuda-runtime-cu11      11.7.99
nvidia-cudnn-cu11             8.5.0.96
nvidia-cufft-cu11             10.9.0.58
nvidia-curand-cu11            10.2.10.91
nvidia-cusolver-cu11          11.4.0.1
nvidia-cusparse-cu11          11.7.4.91
nvidia-nccl-cu11              2.14.3
nvidia-nvtx-cu11              11.7.91
oauthlib                      3.2.0
onnxruntime                   1.13.1
opt-einsum                    3.3.0
osfclient                     0.0.5
packaging                     23.1
pandas                        2.0.3
parso                         0.8.3
pathtools                     0.1.2
pexpect                       4.8.0
pickleshare                   0.7.5
Pillow                        10.0.0
pip                           23.1.2
platformdirs                  3.8.0
ply                           3.11
pooch                         1.7.0
promise                       2.3
prompt-toolkit                3.0.39
protobuf                      3.20.3
psutil                        5.9.5
ptyprocess                    0.7.0
pure-eval                     0.2.2
pyasn1                        0.4.8
pyasn1-modules                0.2.8
pybids                        0.15.5
pycparser                     2.21
Pygments                      2.15.1
PyGObject                     3.42.1
PyJWT                         2.3.0
pymacaroons                   0.13.0
PyNaCl                        1.5.0
PyOpenGL                      3.1.6
pyparsing                     3.0.9
pyrsistent                    0.18.1
PySocks                       1.7.1
python-apt                    2.4.0+ubuntu1
python-dateutil               2.8.2
python-dotenv                 0.19.2
python-magic                  0.4.24
pythran                       0.10.0
pytorch-ignite                0.4.11
pytorch-lightning             2.0.4
pytz                          2023.3
PyWavelets                    1.4.1
PyYAML                        6.0
pyzmq                         25.1.0
requests                      2.31.0
requests-oauthlib             1.3.1
requests-toolbelt             0.9.1
rich                          12.6.0
roman                         3.3
rsa                           4.8
s3transfer                    0.5.0
scikit-image                  0.19.3
scikit-learn                  1.3.0
scipy                         1.11.1
screen-resolution-extra       0.0.0
seaborn                       0.12.1
SecretStorage                 3.3.1
sentry-sdk                    1.21.1
setproctitle                  1.3.2
setuptools                    68.0.0
shellingham                   1.5.0
shortuuid                     1.0.11
SimpleITK                     2.2.1
six                           1.16.0
smmap                         3.0.5
sos                           4.4
soupsieve                     2.4
SQLAlchemy                    1.3.24
ssh-import-id                 5.11
stack-data                    0.6.2
sympy                         1.12
systemd-python                234
tensorboard                   2.12.3
tensorboard-data-server       0.7.0
tensorboard-plugin-wit        1.8.1
tensorflow                    2.12.0
tensorflow-estimator          2.12.0
tensorflow-io-gcs-filesystem  0.26.0
termcolor                     1.1.0
texttable                     1.6.4
threadpoolctl                 3.1.0
tifffile                      2022.10.10
torch                         2.0.0
torchaudio                    2.0.1
torchio                       0.18.86
torchmetrics                  0.11.4
torchvision                   0.15.1
tornado                       6.3.2
tqdm                          4.65.0
traitlets                     5.9.0
triton                        2.0.0
typer                         0.7.0
typing_extensions             4.7.1
tzdata                        2023.3
ufw                           0.36.1
unattended-upgrades           0.1
urllib3                       2.0.3
virtualenv                    20.13.0+ds
wadllib                       1.3.6
wandb                         0.15.5
wcwidth                       0.2.6
websocket-client              1.2.3
Werkzeug                      2.1.2
wheel                         0.40.0
wrapt                         1.14.1
wxPython                      4.0.7
xkit                          0.0.0
yarl                          1.9.2
zipp                          3.15.0

louisfb01 · 2023-07-18T14:21:17Z

Updated the answer above with a new temporary fix.

Adding this line to the training script did work torch.multiprocessing.set_sharing_strategy('file_system'), as in this issue and here. I believe I previously did not add it in the right section of the code by mistake. It now works as expected!

It is not super clean but it at least allows us to train normally for now.

jcohenadad · 2023-07-18T14:24:17Z

It is not super clean

Why is that?

louisfb01 · 2023-07-18T14:30:39Z

It is not super clean

Why is that?

Nevermind on that. I didn't like the idea of having to add this line and thought it was a "hard-coded" fix to a PyTorch issue, but it seems like a normal behaviour after further research. This issue can be closed with the solution of adding the torch.multiprocessing.set_sharing_strategy('file_system') line to the beginning of your training script.

jcohenadad · 2023-07-18T14:37:40Z

I just realized this magic syntax also fixed an issue for me in the past 😅
jcohenadad/model-seg-ms-mp2rage-monai@450f72e

naga-karthik · 2023-07-18T14:39:39Z

Is there any deeper explanation anywhere as to why this fix is working?

louisfb01 · 2023-07-18T14:55:59Z

Is there any deeper explanation anywhere as to why this fix is working?

This is what I found:

torch.multiprocessing is a wrapper around the native multiprocessing module. It registers custom reducers, that use shared memory to provide shared views on the same data in different processes. Once the tensor/storage is moved to shared_memory (see share_memory_()), it will be possible to send it to other processes without making any copies. (from torch documentation)

And from what I understand, this memory issue comes from using the CacheDataset and it has to do with pytorch's sharing strategy. The function torch.multiprocessing.set_sharing_strategy('file_system') will enable a flag to not create extra file descriptors, more information here.

It still seems to be a temporary fix to "having a high enough limit" in our system. This can be done increasing ulimit, but not in our case since we are limited as a user in Romane.

Quote from PyTorch doc:

Still, if your system has high enough limits, and file_descriptor is a supported strategy, we do not recommend switching to this one.

This one referring to using the file_system flag.

naga-karthik · 2023-07-18T15:10:56Z

And from what I understand, this memory issue comes from using the CacheDataset and it has to do with pytorch's sharing strategy.

ah this is a good point! Monai also provides PyTorch's native Dataset class. Could you please try using that once and remove the multiprocessing fix to see if it is really CacheDataset that's the culprit? (you just have switch to Dataset and use the right arguments, just a line of code)

louisfb01 · 2023-07-27T15:31:49Z

I tried using MONAI's Dataset class instead of CacheDataset.

Using Dataset does not allow us to remove torch.multiprocessing.set_sharing_strategy('file_system').

louisfb01 · 2023-07-27T15:42:35Z

New error even with torch.multiprocessing.set_sharing_strategy('file_system').

Happens both when using MONAI's Dataset class and CacheDataset.

The error occurs with all the aggregated datasets (approx. 7k total images from train/val/test) and it only happens at 73% (with CacheDataset) and 75% (with Dataset) of the first epoch:

23 Epoch 0:  75%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍                                  | 1069/1426 [1:01:50<20:39,  3.47s/it, v_num=ymin]
24 Exception in thread Thread-7:
25 Traceback (most recent call last):
26   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/threading.py", line 980, in _bootstrap_inner
27     self.run()
28   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/threading.py", line 917, in run
29     self._target(*self._args, **self._kwargs)
30   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in _pin_memory_loop
31     do_one_step()
32   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in do_one_step
33     r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
34   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/multiprocessing/queues.py", line 122, in get
35     return _ForkingPickler.loads(res)
36   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 324, in rebuild_storage_filename
37     storage = torch.UntypedStorage._new_shared_filename_cpu(manager, handle, size)
38 RuntimeError: unable to mmap 160 bytes from file </torch_2814340_3051998398_49246>: Cannot allocate memory (12)
39 Traceback (most recent call last):
40   File "/home/GRAMES.POLYMTL.CA/lobouz/github/contrast-agnostic-softseg-spinalcord/monai/main.py", line 636, in <module>
41     main(args)
42   File "/home/GRAMES.POLYMTL.CA/lobouz/github/contrast-agnostic-softseg-spinalcord/monai/main.py", line 536, in main
43     trainer.fit(pl_model)
44   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 531, in fit
45     call._call_and_handle_interrupt(
46   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
47     return trainer_fn(*args, **kwargs)
48   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 570, in _fit_impl
49     self._run(model, ckpt_path=ckpt_path)
50   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 975, in _run
51     results = self._run_stage()
52   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1018, in _run_stage
53     self.fit_loop.run()
54   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 201, in run

Plus, the epoch takes over 1:30 hours to run. I think the next step is to investigate using compute Canada and train with more GPUs and memory. Anyways, as @naga-karthik and I saw, Romane is getting pretty crowded nowadays and it is hard to train when you want.

jcohenadad · 2023-07-27T17:27:04Z

I think the next step is to investigate using compute Canada and train with more GPUs and memory.

👍

louisfb01 · 2023-08-02T17:08:19Z

Regarding the issue above, I looked further into it, and it seems nobody can explain it (?).
For now, one solution is to put num_workers=0 once again, based on this thread, with no better solution. Reminder that using num_workers=0 is not viable in our case. One epoch is close to 5 hours instead of 1:40 hours with this setting.

Some threads are also still opened, not finding the cause of the issue with another fix (will update the comment once I've tried them):

We have no root cause but this issue disappeared now after we fixed a host mem leak issue in our project.
If you also met this problem, you can set the OS mmap limitation much higher as a work around. Or set the worker = 0 in dataloader also helps. From this issue..
-> requires sudo, I don't think it is pertinent.

Another potential solution is to implement our own version of the dataset class and implement this modification to transform the data into torch tensors. -> did not work.

Now working on implementing the code for Compute Canada to train with multiple GPUs and will update if we have the same problem.

naga-karthik · 2024-06-12T17:25:07Z

closing this as it is not relevant anymore -- I was able to train the model on 11 datasets at the moment

plbenveniste · 2024-07-16T18:18:35Z

Hi @naga-karthik ! I am re-opening this issue as I am facing the same problem as @louisfb01 when training using monai and large datasets. My training crashes during the validation step even though I am using torch.multiprocessing.set_sharing_strategy('file_system'). Here is my code..

The error I get is the following:

Error message

Traceback (most recent call last):
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in _pin_memory_loop
    do_one_step()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 324, in rebuild_storage_filename
    storage = torch.UntypedStorage._new_shared_filename_cpu(manager, handle, size)
RuntimeError: unable to mmap 68 bytes from file </torch_1405374_144476348_9956>: Cannot allocate memory (12)
Traceback (most recent call last):
  File "/home/plbenveniste/ms_lesion_agnostic/ms-lesion-agnostic/monai/train_monai_unet_lightning.py", line 823, in <module>
    main()
  File "/home/plbenveniste/ms_lesion_agnostic/ms-lesion-agnostic/monai/train_monai_unet_lightning.py", line 815, in main
    trainer.fit(pl_model)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run
    results = self._run_stage()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1033, in _run_stage
    self.fit_loop.run()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 141, in run
    self.on_advance_end(data_fetcher)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 295, in on_advance_end
    self.val_loop.run()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 128, in run
    batch, batch_idx, dataloader_idx = next(data_fetcher)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/fetchers.py", line 133, in __next__
    batch = super().__next__()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/fetchers.py", line 60, in __next__
    batch = next(self.iterator)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/utilities/combined_loader.py", line 142, in __next__
    out = next(self.iterators[0])
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
    idx, data = self._get_data()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1289, in _get_data
    raise RuntimeError('Pin memory thread exited unexpectedly')
RuntimeError: Pin memory thread exited unexpectedly

I had to set num_workers to zero in the validation to get the code to work. Is there any cleaner one around this ? (Btw I am running this on this server: GPU/CPU Server - 512 cores-threads – Epyc 3.0GHz – 128G RAM – 4 GPUs (4x Nvidia A6000) and the memory is not reaching its limit.

falqa · 2024-08-20T00:43:43Z

@plbenveniste Have you had any luck with finding a solution?

louisfb01 added the bug Something isn't working label Jul 12, 2023

louisfb01 self-assigned this Jul 12, 2023

jcohenadad changed the title ~~Issue with training on Monai with large datasets~~ Memory issue with training on Monai with large datasets Jul 14, 2023

louisfb01 closed this as completed Jul 18, 2023

naga-karthik reopened this Jul 18, 2023

naga-karthik closed this as completed Jun 12, 2024

plbenveniste reopened this Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issue with training on Monai with large datasets #59

Memory issue with training on Monai with large datasets #59

louisfb01 commented Jul 12, 2023

naga-karthik commented Jul 12, 2023

louisfb01 commented Jul 12, 2023

jcohenadad commented Jul 14, 2023

louisfb01 commented Jul 16, 2023 •

edited

Loading

jcohenadad commented Jul 17, 2023 •

edited

Loading

louisfb01 commented Jul 18, 2023

louisfb01 commented Jul 18, 2023 •

edited

Loading

jcohenadad commented Jul 18, 2023

louisfb01 commented Jul 18, 2023

jcohenadad commented Jul 18, 2023

naga-karthik commented Jul 18, 2023

louisfb01 commented Jul 18, 2023 •

edited

Loading

naga-karthik commented Jul 18, 2023

louisfb01 commented Jul 27, 2023 •

edited

Loading

louisfb01 commented Jul 27, 2023 •

edited

Loading

jcohenadad commented Jul 27, 2023

louisfb01 commented Aug 2, 2023 •

edited

Loading

naga-karthik commented Jun 12, 2024

plbenveniste commented Jul 16, 2024

falqa commented Aug 20, 2024

Memory issue with training on Monai with large datasets #59

Memory issue with training on Monai with large datasets #59

Comments

louisfb01 commented Jul 12, 2023

naga-karthik commented Jul 12, 2023

louisfb01 commented Jul 12, 2023

jcohenadad commented Jul 14, 2023

louisfb01 commented Jul 16, 2023 • edited Loading

jcohenadad commented Jul 17, 2023 • edited Loading

louisfb01 commented Jul 18, 2023

louisfb01 commented Jul 18, 2023 • edited Loading

jcohenadad commented Jul 18, 2023

louisfb01 commented Jul 18, 2023

jcohenadad commented Jul 18, 2023

naga-karthik commented Jul 18, 2023

louisfb01 commented Jul 18, 2023 • edited Loading

naga-karthik commented Jul 18, 2023

louisfb01 commented Jul 27, 2023 • edited Loading

louisfb01 commented Jul 27, 2023 • edited Loading

jcohenadad commented Jul 27, 2023

louisfb01 commented Aug 2, 2023 • edited Loading

naga-karthik commented Jun 12, 2024

plbenveniste commented Jul 16, 2024

falqa commented Aug 20, 2024

louisfb01 commented Jul 16, 2023 •

edited

Loading

jcohenadad commented Jul 17, 2023 •

edited

Loading

louisfb01 commented Jul 18, 2023 •

edited

Loading

louisfb01 commented Jul 18, 2023 •

edited

Loading

louisfb01 commented Jul 27, 2023 •

edited

Loading

louisfb01 commented Jul 27, 2023 •

edited

Loading

louisfb01 commented Aug 2, 2023 •

edited

Loading