The correct way to enable multi-GPU training #8

sleeplessai · 2021-11-03T17:19:12Z

I am conducting some experiments using this MVSNet implementation, since its clear and simple PyTorch Lightning warping.
For faster training process, the model is trained with 3 GPUs on my server, while an error comes out when the hyperparameter --gpu_num simply was set to 3. The PyTorch Lightning raised verbose information:

"You seem to have configured a sampler in your DataLoader. This will be replaced "
" by `DistributedSampler` since `replace_sampler_ddp` is True and you are using"
" distributed training. Either remove the sampler from your DataLoader or set"
" `replace_sampler_ddp=False` if you want to use your custom sampler."

To solve this problem, the train.py code has been modified by setting parameter in PL Trainer:

trainer = Trainer(#......
                  gpus=hparams.num_gpus,
                  replace_sampler_ddp=False,
                  distributed_backend='ddp' if hparams.num_gpus>1 else None,
                  # ......)

The model can be trained after this hyperparameter configured.

Is this the correct way to enable multi-GPU training manner?
For some reason, I cannot install nvidia-apex for current server.
Should and how do I use SyncBatchNorm for this model implementation?
Does it bear on performance without SyncBN?
Please tell me if I should, using nn.SyncBatchNorm.convert_sync_batchnorm() or PyTorch Lightning sync_bn in Trainer configuration?

Thanks a lot. 😊

The text was updated successfully, but these errors were encountered:

Geo-Tell · 2021-12-03T12:59:37Z

hello,Have you solved the problem @sleeplessai

sleeplessai · 2021-12-04T18:40:15Z

Hi, @geovsion.
Yes, I had solved the multiple GPU training by specifying the num_gpus property for PL trainer and adding SyncBatchNorm support. For this, I updated the main packages PL to 0.9.0 and PyTorch to 1.6.0.
As the author didn't give quick reply, I folked the original repo manually to
sleeplessai/mvsnet2_pl for maintaining in the future. The code has been tested on a 3 GPU cluster node and works well

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The correct way to enable multi-GPU training #8

The correct way to enable multi-GPU training #8

sleeplessai commented Nov 3, 2021

Geo-Tell commented Dec 3, 2021

sleeplessai commented Dec 4, 2021 •

edited

Loading

The correct way to enable multi-GPU training #8

The correct way to enable multi-GPU training #8

Comments

sleeplessai commented Nov 3, 2021

Geo-Tell commented Dec 3, 2021

sleeplessai commented Dec 4, 2021 • edited Loading

sleeplessai commented Dec 4, 2021 •

edited

Loading