Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issue with training on Monai with large datasets #59

Open
louisfb01 opened this issue Jul 12, 2023 · 20 comments
Open

Memory issue with training on Monai with large datasets #59

louisfb01 opened this issue Jul 12, 2023 · 20 comments
Assignees
Labels
bug Something isn't working

Comments

@louisfb01
Copy link
Collaborator

Hey All! So as discussed with @naga-karthik I've had some issues with the "aggregated training".

I am running into memory issues where no fixes seem to work.

I am using Naga's training script and tried pretty much all solutions I could find, but I always get a RuntimeError('Pin memory thread exited unexpectedly'). This can be fixed by using 0 workers in these lines, but makes the training super slow (15+ mins per epoch).

I tried with the exact same configurations as Naga too. The only difference is the amount of images. Plus, it works when removing most images from the dataset.json file (so when working with a much smaller set).

I am still investigating this issue...

@louisfb01 louisfb01 added the bug Something isn't working label Jul 12, 2023
@louisfb01 louisfb01 self-assigned this Jul 12, 2023
@naga-karthik
Copy link
Collaborator

This is indeed strange. Can you also post the exact arguments that you're using with main.py?

@louisfb01
Copy link
Collaborator Author

Pretty much no argument, just correct paths to data and default unet, 100 epochs, and your hard coded values.

@jcohenadad jcohenadad changed the title Issue with training on Monai with large datasets Memory issue with training on Monai with large datasets Jul 14, 2023
@jcohenadad
Copy link
Member

@louisfb01 can you please list in this issue thread the various discussions on this topic in MONAI GH, slack, forums, etc.

@louisfb01
Copy link
Collaborator Author

louisfb01 commented Jul 16, 2023

The source seems to be a memory leak when storing to much data from a dataloader, from this. But in his case it happens running the test set whereas here it happens when training.

Solutions tried:

- Adding this line to the training script did not work torch.multiprocessing.set_sharing_strategy('file_system'), as in this issue.

  • Increasing ulimit was another solution but it is limited to 2000 for a user in Romane. Increasing from 1024 to 2000 did not solve the problem.
  • Changing pin_memory to False in those lines did not solve the issue, rather caused another problem received 0 items of ancdata, similar to here and this thread.
  • Changing the training batch size to 1,2,4 did not help.
  • Changing the number of workers to 1,2,4,10 did not help.
  • Changing CacheDataset's number of workers did not help either.
  • Changing CacheDataset's cache_rate to 0.1 instead of 0.25 did not help.

The current workaround is to use num_workes=0 as in this issue and this one, and more.

New temporary fix:

  • Adding this line to the training script did work torch.multiprocessing.set_sharing_strategy('file_system'), as in this issue. I believe I previously did not add it in the right section of the code by mistake. It now works as expected!

@jcohenadad
Copy link
Member

jcohenadad commented Jul 17, 2023

@louisfb01 can you please

  • show us the full STDOUT
  • open issue on pytorch and describe your issue
  • cross-ref with this issue so we can follow-up

@louisfb01
Copy link
Collaborator Author

Here is more information about the environment (python version 3.9.17) and the output I get from running main.py (training with PyTorch, PyTorch lightning, MONAI).

STDOUT
(monai_training) lobouz@romane:~/github/contrast-agnostic-softseg-spinalcord/monai$ CUDA_VISIBLE_DEVICES=3 python main.py -m unet -nspv 4 -ncv 1 -initf 8 -bs 4 -lr 1e-3 -cve 4 -stp -epb 
Global seed set to 42
2023-07-18 10:01:13.007 | INFO     | __main__:main:472 -  Training on fold 1 out of 1 folds! 
/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/utilities/parsing.py:196: UserWarning: Attribute 'loss_function' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['loss_function'])`.
  rank_zero_warn(
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.5
wandb: Run data is saved locally in /home/GRAMES.POLYMTL.CA/lobouz/contrast-agnostic/saved_models/wandb/run-20230718_100119-9xlhu2uf
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run unet_nf=8_nrs=2_lr=0.001_20230718-1001
wandb: ⭐️ View project at https://wandb.ai/whats_ai/contrast-agnostic
wandb: 🚀 View run at https://wandb.ai/whats_ai/contrast-agnostic/runs/9xlhu2uf
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/monai/utils/deprecate_utils.py:321: FutureWarning: monai.transforms.io.dictionary LoadImaged.__init__:image_only: Current default value of argument `image_only=False` has been deprecated since version 1.1. It will be changed to `image_only=True` in version 1.3.
  warn_deprecated(argname, msg, warning_category)
Loading dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 476/476 [01:45<00:00,  4.53it/s]
Loading dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 119/119 [00:37<00:00,  3.20it/s]
Loading dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29/29 [00:10<00:00,  2.78it/s]
You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [3]

  | Name          | Type         | Params
-----------------------------------------------
0 | net           | UNet         | 1.2 M 
1 | loss_function | SoftDiceLoss | 0     
-----------------------------------------------
1.2 M     Trainable params
0         Non-trainable params
1.2 M     Total params
4.809     Total estimated model params size (MB)
Sanity Checking DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.34s/it]Current epoch: 0
Average Soft Dice (VAL): 0.0071
Average Hard Dice (VAL): 0.0020
Best Average Soft Dice: 0.0071 at Epoch: 0
----------------------------------------------------
Epoch 0:   3%|████▌                                                                                                                                                      | 14/477 [00:23<12:59,  1.68s/it, v_num=u2uf]Exception in thread Thread-28:
Traceback (most recent call last):
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in _pin_memory_loop
    do_one_step()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 307, in rebuild_storage_fd
    fd = df.detach()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/multiprocessing/reduction.py", line 164, in recvfds
    raise RuntimeError('received %d items of ancdata' %
RuntimeError: received 0 items of ancdata
Traceback (most recent call last):
  File "/home/GRAMES.POLYMTL.CA/lobouz/github/contrast-agnostic-softseg-spinalcord/monai/main.py", line 605, in <module>
    main(args)
  File "/home/GRAMES.POLYMTL.CA/lobouz/github/contrast-agnostic-softseg-spinalcord/monai/main.py", line 516, in main
    trainer.fit(pl_model)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 531, in fit
    call._call_and_handle_interrupt(
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 570, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 975, in _run
    results = self._run_stage()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1018, in _run_stage
    self.fit_loop.run()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 201, in run
    self.advance()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 354, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 189, in advance
    batch = next(data_fetcher)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/loops/fetchers.py", line 136, in __next__
    self._fetch_next_batch(self.dataloader_iter)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/loops/fetchers.py", line 150, in _fetch_next_batch
    batch = next(iterator)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/utilities/combined_loader.py", line 284, in __next__
    out = next(self._iterator)
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/utilities/combined_loader.py", line 65, in __next__
    out[i] = next(self.iterators[i])
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
    idx, data = self._get_data()
  File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1290, in _get_data
    raise RuntimeError('Pin memory thread exited unexpectedly')
RuntimeError: Pin memory thread exited unexpectedly

Environment details (pip list)
(monai_training) lobouz@romane:~$ pip list
Package                       Version
----------------------------- --------------
absl-py                       1.1.0
aiohttp                       3.8.4
aiosignal                     1.3.1
appdirs                       1.4.4
astor                         0.8.1
asttokens                     2.2.1
astunparse                    1.6.3
async-timeout                 4.0.2
attrs                         21.2.0
awscli                        1.22.34
backcall                      0.2.0
backports.functools-lru-cache 1.6.5
beautifulsoup4                4.11.2
beniget                       0.4.1
bids-validator                1.9.9
blinker                       1.4
botocore                      1.23.34
Brotli                        1.0.9
brz-etckeeper                 0.0.0
cachetools                    5.2.0
certifi                       2023.5.7
cffi                          1.15.1
chardet                       4.0.0
charset-normalizer            3.1.0
click                         8.1.3
cmake                         3.26.4
colorama                      0.4.6
coloredlogs                   15.0.1
comm                          0.1.3
command-not-found             0.3
commonmark                    0.9.1
contourpy                     1.1.0
cryptography                  3.4.8
csv-diff                      1.1
cycler                        0.11.0
dbus-python                   1.2.18
debugpy                       1.6.7
decorator                     5.1.1
Deprecated                    1.2.13
dictdiffer                    0.9.0
dill                          0.3.5.1
distlib                       0.3.4
distro                        1.7.0
distro-info                   1.1build1
dnspython                     2.1.0
docker                        5.0.3
docker-compose                1.29.2
docker-pycreds                0.4.0
dockerpty                     0.4.1
docopt                        0.6.2
docutils                      0.17.1
entrypoints                   0.4
executing                     1.2.0
filelock                      3.12.2
flatbuffers                   2.0.7
fonttools                     4.40.0
formulaic                     0.3.4
frozenlist                    1.3.3
fsleyes                       1.7.0
fsleyes-props                 1.8.2
fsleyes-widgets               0.14.2
fslpy                         3.13.0
fsspec                        2023.6.0
gast                          0.4.0
gdown                         4.6.4
gitdb                         4.0.10
GitPython                     3.1.31
gmpy2                         2.1.2
google-auth                   2.19.0
google-auth-oauthlib          1.0.0
google-pasta                  0.2.0
gpg                           1.16.0-unknown
grpcio                        1.54.2
h5py                          3.7.0
httplib2                      0.20.2
humanfriendly                 10.0
humanize                      4.4.0
idna                          3.4
imageio                       2.22.4
imgaug                        0.2.5
importlib-metadata            6.8.0
importlib-resources           6.0.0
interface-meta                1.3.0
iotop                         0.6
ipykernel                     6.24.0
ipython                       8.14.0
ivadomed                      2.9.7
jax                           0.4.11
jedi                          0.18.2
jeepney                       0.7.1
Jinja2                        3.1.2
jmespath                      0.10.0
joblib                        1.3.0
jsonschema                    3.2.0
jupyter_client                8.3.0
jupyter_core                  5.3.1
keras                         2.12.0
Keras-Preprocessing           1.1.2
keyring                       23.5.0
kiwisolver                    1.4.4
launchpadlib                  1.10.16
lazr.restfulclient            0.14.4
lazr.uri                      1.0.6
libclang                      14.0.1
lightning-utilities           0.9.0
lit                           16.0.6
loguru                        0.7.0
Markdown                      3.3.6
MarkupSafe                    2.1.3
matplotlib                    3.7.2
matplotlib-inline             0.1.6
ml-dtypes                     0.1.0
monai                         1.2.0
monai-weekly                  1.2.dev2311
more-itertools                8.10.0
mpmath                        1.3.0
multidict                     6.0.4
nest-asyncio                  1.5.6
netifaces                     0.11.0
networkx                      3.1
nibabel                       5.1.0
num2words                     0.5.12
numpy                         1.25.0
nvidia-cublas-cu11            11.10.3.66
nvidia-cuda-cupti-cu11        11.7.101
nvidia-cuda-nvrtc-cu11        11.7.99
nvidia-cuda-runtime-cu11      11.7.99
nvidia-cudnn-cu11             8.5.0.96
nvidia-cufft-cu11             10.9.0.58
nvidia-curand-cu11            10.2.10.91
nvidia-cusolver-cu11          11.4.0.1
nvidia-cusparse-cu11          11.7.4.91
nvidia-nccl-cu11              2.14.3
nvidia-nvtx-cu11              11.7.91
oauthlib                      3.2.0
onnxruntime                   1.13.1
opt-einsum                    3.3.0
osfclient                     0.0.5
packaging                     23.1
pandas                        2.0.3
parso                         0.8.3
pathtools                     0.1.2
pexpect                       4.8.0
pickleshare                   0.7.5
Pillow                        10.0.0
pip                           23.1.2
platformdirs                  3.8.0
ply                           3.11
pooch                         1.7.0
promise                       2.3
prompt-toolkit                3.0.39
protobuf                      3.20.3
psutil                        5.9.5
ptyprocess                    0.7.0
pure-eval                     0.2.2
pyasn1                        0.4.8
pyasn1-modules                0.2.8
pybids                        0.15.5
pycparser                     2.21
Pygments                      2.15.1
PyGObject                     3.42.1
PyJWT                         2.3.0
pymacaroons                   0.13.0
PyNaCl                        1.5.0
PyOpenGL                      3.1.6
pyparsing                     3.0.9
pyrsistent                    0.18.1
PySocks                       1.7.1
python-apt                    2.4.0+ubuntu1
python-dateutil               2.8.2
python-dotenv                 0.19.2
python-magic                  0.4.24
pythran                       0.10.0
pytorch-ignite                0.4.11
pytorch-lightning             2.0.4
pytz                          2023.3
PyWavelets                    1.4.1
PyYAML                        6.0
pyzmq                         25.1.0
requests                      2.31.0
requests-oauthlib             1.3.1
requests-toolbelt             0.9.1
rich                          12.6.0
roman                         3.3
rsa                           4.8
s3transfer                    0.5.0
scikit-image                  0.19.3
scikit-learn                  1.3.0
scipy                         1.11.1
screen-resolution-extra       0.0.0
seaborn                       0.12.1
SecretStorage                 3.3.1
sentry-sdk                    1.21.1
setproctitle                  1.3.2
setuptools                    68.0.0
shellingham                   1.5.0
shortuuid                     1.0.11
SimpleITK                     2.2.1
six                           1.16.0
smmap                         3.0.5
sos                           4.4
soupsieve                     2.4
SQLAlchemy                    1.3.24
ssh-import-id                 5.11
stack-data                    0.6.2
sympy                         1.12
systemd-python                234
tensorboard                   2.12.3
tensorboard-data-server       0.7.0
tensorboard-plugin-wit        1.8.1
tensorflow                    2.12.0
tensorflow-estimator          2.12.0
tensorflow-io-gcs-filesystem  0.26.0
termcolor                     1.1.0
texttable                     1.6.4
threadpoolctl                 3.1.0
tifffile                      2022.10.10
torch                         2.0.0
torchaudio                    2.0.1
torchio                       0.18.86
torchmetrics                  0.11.4
torchvision                   0.15.1
tornado                       6.3.2
tqdm                          4.65.0
traitlets                     5.9.0
triton                        2.0.0
typer                         0.7.0
typing_extensions             4.7.1
tzdata                        2023.3
ufw                           0.36.1
unattended-upgrades           0.1
urllib3                       2.0.3
virtualenv                    20.13.0+ds
wadllib                       1.3.6
wandb                         0.15.5
wcwidth                       0.2.6
websocket-client              1.2.3
Werkzeug                      2.1.2
wheel                         0.40.0
wrapt                         1.14.1
wxPython                      4.0.7
xkit                          0.0.0
yarl                          1.9.2
zipp                          3.15.0

@louisfb01
Copy link
Collaborator Author

louisfb01 commented Jul 18, 2023

Updated the answer above with a new temporary fix.

Adding this line to the training script did work torch.multiprocessing.set_sharing_strategy('file_system'), as in this issue and here. I believe I previously did not add it in the right section of the code by mistake. It now works as expected!

It is not super clean but it at least allows us to train normally for now.

@jcohenadad
Copy link
Member

It is not super clean

Why is that?

@louisfb01
Copy link
Collaborator Author

It is not super clean

Why is that?

Nevermind on that. I didn't like the idea of having to add this line and thought it was a "hard-coded" fix to a PyTorch issue, but it seems like a normal behaviour after further research. This issue can be closed with the solution of adding the torch.multiprocessing.set_sharing_strategy('file_system') line to the beginning of your training script.

@jcohenadad
Copy link
Member

I just realized this magic syntax also fixed an issue for me in the past 😅
jcohenadad/model-seg-ms-mp2rage-monai@450f72e

@naga-karthik
Copy link
Collaborator

Is there any deeper explanation anywhere as to why this fix is working?

@naga-karthik naga-karthik reopened this Jul 18, 2023
@louisfb01
Copy link
Collaborator Author

louisfb01 commented Jul 18, 2023

Is there any deeper explanation anywhere as to why this fix is working?

This is what I found:

torch.multiprocessing is a wrapper around the native multiprocessing module. It registers custom reducers, that use shared memory to provide shared views on the same data in different processes. Once the tensor/storage is moved to shared_memory (see share_memory_()), it will be possible to send it to other processes without making any copies. (from torch documentation)

And from what I understand, this memory issue comes from using the CacheDataset and it has to do with pytorch's sharing strategy. The function torch.multiprocessing.set_sharing_strategy('file_system') will enable a flag to not create extra file descriptors, more information here.

It still seems to be a temporary fix to "having a high enough limit" in our system. This can be done increasing ulimit, but not in our case since we are limited as a user in Romane.

Quote from PyTorch doc:

Still, if your system has high enough limits, and file_descriptor is a supported strategy, we do not recommend switching to this one.

This one referring to using the file_system flag.

@naga-karthik
Copy link
Collaborator

And from what I understand, this memory issue comes from using the CacheDataset and it has to do with pytorch's sharing strategy.

ah this is a good point! Monai also provides PyTorch's native Dataset class. Could you please try using that once and remove the multiprocessing fix to see if it is really CacheDataset that's the culprit? (you just have switch to Dataset and use the right arguments, just a line of code)

@louisfb01
Copy link
Collaborator Author

louisfb01 commented Jul 27, 2023

I tried using MONAI's Dataset class instead of CacheDataset.

Using Dataset does not allow us to remove torch.multiprocessing.set_sharing_strategy('file_system').

@louisfb01
Copy link
Collaborator Author

louisfb01 commented Jul 27, 2023

New error even with torch.multiprocessing.set_sharing_strategy('file_system').

Happens both when using MONAI's Dataset class and CacheDataset.

The error occurs with all the aggregated datasets (approx. 7k total images from train/val/test) and it only happens at 73% (with CacheDataset) and 75% (with Dataset) of the first epoch:

23 Epoch 0:  75%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍                                  | 1069/1426 [1:01:50<20:39,  3.47s/it, v_num=ymin]
24 Exception in thread Thread-7:
25 Traceback (most recent call last):
26   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/threading.py", line 980, in _bootstrap_inner
27     self.run()
28   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/threading.py", line 917, in run
29     self._target(*self._args, **self._kwargs)
30   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in _pin_memory_loop
31     do_one_step()
32   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in do_one_step
33     r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
34   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/multiprocessing/queues.py", line 122, in get
35     return _ForkingPickler.loads(res)
36   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 324, in rebuild_storage_filename
37     storage = torch.UntypedStorage._new_shared_filename_cpu(manager, handle, size)
38 RuntimeError: unable to mmap 160 bytes from file </torch_2814340_3051998398_49246>: Cannot allocate memory (12)
39 Traceback (most recent call last):
40   File "/home/GRAMES.POLYMTL.CA/lobouz/github/contrast-agnostic-softseg-spinalcord/monai/main.py", line 636, in <module>
41     main(args)
42   File "/home/GRAMES.POLYMTL.CA/lobouz/github/contrast-agnostic-softseg-spinalcord/monai/main.py", line 536, in main
43     trainer.fit(pl_model)
44   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 531, in fit
45     call._call_and_handle_interrupt(
46   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
47     return trainer_fn(*args, **kwargs)
48   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 570, in _fit_impl
49     self._run(model, ckpt_path=ckpt_path)
50   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 975, in _run
51     results = self._run_stage()
52   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1018, in _run_stage
53     self.fit_loop.run()
54   File "/home/GRAMES.POLYMTL.CA/lobouz/miniconda3/envs/monai_training/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 201, in run

Plus, the epoch takes over 1:30 hours to run. I think the next step is to investigate using compute Canada and train with more GPUs and memory. Anyways, as @naga-karthik and I saw, Romane is getting pretty crowded nowadays and it is hard to train when you want.

@jcohenadad
Copy link
Member

I think the next step is to investigate using compute Canada and train with more GPUs and memory.

👍

@louisfb01
Copy link
Collaborator Author

louisfb01 commented Aug 2, 2023

Regarding the issue above, I looked further into it, and it seems nobody can explain it (?).
For now, one solution is to put num_workers=0 once again, based on this thread, with no better solution. Reminder that using num_workers=0 is not viable in our case. One epoch is close to 5 hours instead of 1:40 hours with this setting.

Some threads are also still opened, not finding the cause of the issue with another fix (will update the comment once I've tried them):

We have no root cause but this issue disappeared now after we fixed a host mem leak issue in our project.
If you also met this problem, you can set the OS mmap limitation much higher as a work around. Or set the worker = 0 in dataloader also helps. From this issue..
-> requires sudo, I don't think it is pertinent.

Another potential solution is to implement our own version of the dataset class and implement this modification to transform the data into torch tensors. -> did not work.

Now working on implementing the code for Compute Canada to train with multiple GPUs and will update if we have the same problem.

@naga-karthik
Copy link
Collaborator

closing this as it is not relevant anymore -- I was able to train the model on 11 datasets at the moment

@plbenveniste
Copy link
Collaborator

Hi @naga-karthik ! I am re-opening this issue as I am facing the same problem as @louisfb01 when training using monai and large datasets. My training crashes during the validation step even though I am using torch.multiprocessing.set_sharing_strategy('file_system'). Here is my code..

The error I get is the following:

Error message
Traceback (most recent call last):
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in _pin_memory_loop
    do_one_step()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 324, in rebuild_storage_filename
    storage = torch.UntypedStorage._new_shared_filename_cpu(manager, handle, size)
RuntimeError: unable to mmap 68 bytes from file </torch_1405374_144476348_9956>: Cannot allocate memory (12)
Traceback (most recent call last):
  File "/home/plbenveniste/ms_lesion_agnostic/ms-lesion-agnostic/monai/train_monai_unet_lightning.py", line 823, in <module>
    main()
  File "/home/plbenveniste/ms_lesion_agnostic/ms-lesion-agnostic/monai/train_monai_unet_lightning.py", line 815, in main
    trainer.fit(pl_model)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run
    results = self._run_stage()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1033, in _run_stage
    self.fit_loop.run()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 141, in run
    self.on_advance_end(data_fetcher)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 295, in on_advance_end
    self.val_loop.run()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 128, in run
    batch, batch_idx, dataloader_idx = next(data_fetcher)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/fetchers.py", line 133, in __next__
    batch = super().__next__()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/loops/fetchers.py", line 60, in __next__
    batch = next(self.iterator)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/pytorch_lightning/utilities/combined_loader.py", line 142, in __next__
    out = next(self.iterators[0])
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
    idx, data = self._get_data()
  File "/home/plbenveniste/miniconda3/envs/venv_monai/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1289, in _get_data
    raise RuntimeError('Pin memory thread exited unexpectedly')
RuntimeError: Pin memory thread exited unexpectedly

I had to set num_workers to zero in the validation to get the code to work. Is there any cleaner one around this ? (Btw I am running this on this server: GPU/CPU Server - 512 cores-threads – Epyc 3.0GHz – 128G RAM – 4 GPUs (4x Nvidia A6000) and the memory is not reaching its limit.

@plbenveniste plbenveniste reopened this Jul 16, 2024
@falqa
Copy link

falqa commented Aug 20, 2024

@plbenveniste Have you had any luck with finding a solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants