You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "./tools/test.py", line 261, in
main()
File "./tools/test.py", line 227, in main
model = MMDistributedDataParallel(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801258 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801745 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801743 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801745 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801743 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 232910) of binary: /home/hammar/miniconda3/envs/uniad/bin/python
Traceback (most recent call last):
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I tried then to run the train example (I can only use 4 GPUs):
Traceback (most recent call last):
File "./tools/train.py", line 256, in
Traceback (most recent call last):
File "./tools/train.py", line 256, in
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800494 milliseconds before timing out.
main()
File "./tools/train.py", line 245, in main
custom_train_model(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/train.py", line 21, in custom_train_model
main()
File "./tools/train.py", line 245, in main
custom_train_detector(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/mmdet_train.py", line 70, in custom_train_detector
custom_train_model(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/train.py", line 21, in custom_train_model
model = MMDistributedDataParallel(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in init
custom_train_detector(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/mmdet_train.py", line 70, in custom_train_detector
model = MMDistributedDataParallel(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0.
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800524 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800471 milliseconds before timing out.
Loading NuScenes tables for version v1.0-trainval...
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 346625) of binary: /home/hblab/miniconda3/envs/uniad/bin/python
Traceback (most recent call last):
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Anyone encountered it before?
Thanks!
The text was updated successfully, but these errors were encountered:
Hey everyone,
I am trying to get acquainted with UniAD and followed the instruction but when I tried to run the evaluation example:
./tools/uniad_dist_eval.sh ./projects/configs/stage1_track_map/base_track_map.py ./ckpts/uniad_base_track_map.pth 4
I receive the following error
Traceback (most recent call last):
File "./tools/test.py", line 261, in
main()
File "./tools/test.py", line 227, in main
model = MMDistributedDataParallel(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801258 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801745 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801743 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801745 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801743 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 232910) of binary: /home/hammar/miniconda3/envs/uniad/bin/python
Traceback (most recent call last):
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I tried then to run the train example (I can only use 4 GPUs):
./tools/uniad_dist_train.sh ./projects/configs/stage1_track_map/base_track_map.py 4
but same error:
Traceback (most recent call last):
File "./tools/train.py", line 256, in
Traceback (most recent call last):
File "./tools/train.py", line 256, in
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800494 milliseconds before timing out.
main()
File "./tools/train.py", line 245, in main
custom_train_model(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/train.py", line 21, in custom_train_model
main()
File "./tools/train.py", line 245, in main
custom_train_detector(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/mmdet_train.py", line 70, in custom_train_detector
custom_train_model(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/train.py", line 21, in custom_train_model
model = MMDistributedDataParallel(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in init
custom_train_detector(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/mmdet_train.py", line 70, in custom_train_detector
model = MMDistributedDataParallel(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0.
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800524 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800471 milliseconds before timing out.
Loading NuScenes tables for version v1.0-trainval...
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 346625) of binary: /home/hblab/miniconda3/envs/uniad/bin/python
Traceback (most recent call last):
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Anyone encountered it before?
Thanks!
The text was updated successfully, but these errors were encountered: