You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am conducting some experiments using this MVSNet implementation, since its clear and simple PyTorch Lightning warping.
For faster training process, the model is trained with 3 GPUs on my server, while an error comes out when the hyperparameter --gpu_num simply was set to 3. The PyTorch Lightning raised verbose information:
"You seem to have configured a sampler in your DataLoader. This will be replaced "
" by `DistributedSampler` since `replace_sampler_ddp` is True and you are using"
" distributed training. Either remove the sampler from your DataLoader or set"
" `replace_sampler_ddp=False` if you want to use your custom sampler."
To solve this problem, the train.py code has been modified by setting parameter in PL Trainer:
The model can be trained after this hyperparameter configured.
Is this the correct way to enable multi-GPU training manner?
For some reason, I cannot install nvidia-apex for current server.
Should and how do I use SyncBatchNorm for this model implementation?
Does it bear on performance without SyncBN?
Please tell me if I should, using nn.SyncBatchNorm.convert_sync_batchnorm() or PyTorch Lightning sync_bn in Trainer configuration?
Thanks a lot. 😊
The text was updated successfully, but these errors were encountered:
Hi, @geovsion.
Yes, I had solved the multiple GPU training by specifying the num_gpus property for PL trainer and adding SyncBatchNorm support. For this, I updated the main packages PL to 0.9.0 and PyTorch to 1.6.0.
As the author didn't give quick reply, I folked the original repo manually to sleeplessai/mvsnet2_pl for maintaining in the future. The code has been tested on a 3 GPU cluster node and works well
Hi, @kwea123
I am conducting some experiments using this MVSNet implementation, since its clear and simple PyTorch Lightning warping.
For faster training process, the model is trained with 3 GPUs on my server, while an error comes out when the hyperparameter --gpu_num simply was set to 3. The PyTorch Lightning raised verbose information:
To solve this problem, the train.py code has been modified by setting parameter in PL Trainer:
The model can be trained after this hyperparameter configured.
Is this the correct way to enable multi-GPU training manner?
For some reason, I cannot install nvidia-apex for current server.
Should and how do I use SyncBatchNorm for this model implementation?
Does it bear on performance without SyncBN?
Please tell me if I should, using nn.SyncBatchNorm.convert_sync_batchnorm() or PyTorch Lightning sync_bn in Trainer configuration?
Thanks a lot. 😊
The text was updated successfully, but these errors were encountered: