You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am able to run python dlrm_s_pytorch.py --mini-batch-size=2 --data-size=6 --use-gpu. But when I am trying to run dlrm_s_pytorch.py on single node multiple GPUs with nccl.
Here is the command I used:
pytorch2.0.0/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Unable to import onnx. No module named 'onnx'
usage: dlrm_s_pytorch.py [-h] [--arch-sparse-feature-size ARCH_SPARSE_FEATURE_SIZE] [--arch-embedding-size ARCH_EMBEDDING_SIZE] [--arch-mlp-bot ARCH_MLP_BOT] [--arch-mlp-top ARCH_MLP_TOP] [--arch-interaction-op {dot,cat}]
[--arch-interaction-itself] [--weighted-pooling WEIGHTED_POOLING] [--md-flag] [--md-threshold MD_THRESHOLD] [--md-temperature MD_TEMPERATURE] [--md-round-dims] [--qr-flag] [--qr-threshold QR_THRESHOLD]
[--qr-operation QR_OPERATION] [--qr-collisions QR_COLLISIONS] [--activation-function ACTIVATION_FUNCTION] [--loss-function LOSS_FUNCTION] [--loss-weights LOSS_WEIGHTS] [--loss-threshold LOSS_THRESHOLD]
[--round-targets ROUND_TARGETS] [--data-size DATA_SIZE] [--num-batches NUM_BATCHES] [--data-generation DATA_GENERATION] [--rand-data-dist RAND_DATA_DIST] [--rand-data-min RAND_DATA_MIN]
[--rand-data-max RAND_DATA_MAX] [--rand-data-mu RAND_DATA_MU] [--rand-data-sigma RAND_DATA_SIGMA] [--data-trace-file DATA_TRACE_FILE] [--data-set DATA_SET] [--raw-data-file RAW_DATA_FILE]
[--processed-data-file PROCESSED_DATA_FILE] [--data-randomize DATA_RANDOMIZE] [--data-trace-enable-padding DATA_TRACE_ENABLE_PADDING] [--max-ind-range MAX_IND_RANGE]
[--data-sub-sample-rate DATA_SUB_SAMPLE_RATE] [--num-indices-per-lookup NUM_INDICES_PER_LOOKUP] [--num-indices-per-lookup-fixed NUM_INDICES_PER_LOOKUP_FIXED] [--num-workers NUM_WORKERS] [--memory-map]
[--mini-batch-size MINI_BATCH_SIZE] [--nepochs NEPOCHS] [--learning-rate LEARNING_RATE] [--print-precision PRINT_PRECISION] [--numpy-rand-seed NUMPY_RAND_SEED] [--sync-dense-params SYNC_DENSE_PARAMS]
[--optimizer OPTIMIZER] [--dataset-multiprocessing] [--inference-only] [--quantize-mlp-with-bit QUANTIZE_MLP_WITH_BIT] [--quantize-emb-with-bit QUANTIZE_EMB_WITH_BIT] [--save-onnx] [--use-gpu]
[--local_rank LOCAL_RANK] [--dist-backend DIST_BACKEND] [--print-freq PRINT_FREQ] [--test-freq TEST_FREQ] [--test-mini-batch-size TEST_MINI_BATCH_SIZE] [--test-num-workers TEST_NUM_WORKERS] [--print-time]
[--print-wall-time] [--debug-mode] [--enable-profiling] [--plot-compute-graph] [--tensor-board-filename TENSOR_BOARD_FILENAME] [--save-model SAVE_MODEL] [--load-model LOAD_MODEL] [--mlperf-logging]
[--mlperf-acc-threshold MLPERF_ACC_THRESHOLD] [--mlperf-auc-threshold MLPERF_AUC_THRESHOLD] [--mlperf-bin-loader] [--mlperf-bin-shuffle] [--mlperf-grad-accum-iter MLPERF_GRAD_ACCUM_ITER]
[--lr-num-warmup-steps LR_NUM_WARMUP_STEPS] [--lr-decay-start-step LR_DECAY_START_STEP] [--lr-num-decay-steps LR_NUM_DECAY_STEPS]
dlrm_s_pytorch.py: error: unrecognized arguments: --local-rank=1
Unable to import onnx. No module named 'onnx'
usage: dlrm_s_pytorch.py [-h] [--arch-sparse-feature-size ARCH_SPARSE_FEATURE_SIZE] [--arch-embedding-size ARCH_EMBEDDING_SIZE] [--arch-mlp-bot ARCH_MLP_BOT] [--arch-mlp-top ARCH_MLP_TOP] [--arch-interaction-op {dot,cat}]
[--arch-interaction-itself] [--weighted-pooling WEIGHTED_POOLING] [--md-flag] [--md-threshold MD_THRESHOLD] [--md-temperature MD_TEMPERATURE] [--md-round-dims] [--qr-flag] [--qr-threshold QR_THRESHOLD]
[--qr-operation QR_OPERATION] [--qr-collisions QR_COLLISIONS] [--activation-function ACTIVATION_FUNCTION] [--loss-function LOSS_FUNCTION] [--loss-weights LOSS_WEIGHTS] [--loss-threshold LOSS_THRESHOLD]
[--round-targets ROUND_TARGETS] [--data-size DATA_SIZE] [--num-batches NUM_BATCHES] [--data-generation DATA_GENERATION] [--rand-data-dist RAND_DATA_DIST] [--rand-data-min RAND_DATA_MIN]
[--rand-data-max RAND_DATA_MAX] [--rand-data-mu RAND_DATA_MU] [--rand-data-sigma RAND_DATA_SIGMA] [--data-trace-file DATA_TRACE_FILE] [--data-set DATA_SET] [--raw-data-file RAW_DATA_FILE]
[--processed-data-file PROCESSED_DATA_FILE] [--data-randomize DATA_RANDOMIZE] [--data-trace-enable-padding DATA_TRACE_ENABLE_PADDING] [--max-ind-range MAX_IND_RANGE]
[--data-sub-sample-rate DATA_SUB_SAMPLE_RATE] [--num-indices-per-lookup NUM_INDICES_PER_LOOKUP] [--num-indices-per-lookup-fixed NUM_INDICES_PER_LOOKUP_FIXED] [--num-workers NUM_WORKERS] [--memory-map]
[--mini-batch-size MINI_BATCH_SIZE] [--nepochs NEPOCHS] [--learning-rate LEARNING_RATE] [--print-precision PRINT_PRECISION] [--numpy-rand-seed NUMPY_RAND_SEED] [--sync-dense-params SYNC_DENSE_PARAMS]
[--optimizer OPTIMIZER] [--dataset-multiprocessing] [--inference-only] [--quantize-mlp-with-bit QUANTIZE_MLP_WITH_BIT] [--quantize-emb-with-bit QUANTIZE_EMB_WITH_BIT] [--save-onnx] [--use-gpu]
[--local_rank LOCAL_RANK] [--dist-backend DIST_BACKEND] [--print-freq PRINT_FREQ] [--test-freq TEST_FREQ] [--test-mini-batch-size TEST_MINI_BATCH_SIZE] [--test-num-workers TEST_NUM_WORKERS] [--print-time]
[--print-wall-time] [--debug-mode] [--enable-profiling] [--plot-compute-graph] [--tensor-board-filename TENSOR_BOARD_FILENAME] [--save-model SAVE_MODEL] [--load-model LOAD_MODEL] [--mlperf-logging]
[--mlperf-acc-threshold MLPERF_ACC_THRESHOLD] [--mlperf-auc-threshold MLPERF_AUC_THRESHOLD] [--mlperf-bin-loader] [--mlperf-bin-shuffle] [--mlperf-grad-accum-iter MLPERF_GRAD_ACCUM_ITER]
[--lr-num-warmup-steps LR_NUM_WARMUP_STEPS] [--lr-decay-start-step LR_DECAY_START_STEP] [--lr-num-decay-steps LR_NUM_DECAY_STEPS]
dlrm_s_pytorch.py: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 375622) of binary: /home/xxx/.conda/envs/torch2.0/bin/python
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/xxxpkg/pytorch2.0.0/torch/distributed/launch.py", line 196, in <module>
main()
File "/home/xxx/pkg/pytorch2.0.0/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/xxx/pkg/pytorch2.0.0/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/xxx/pkg/pytorch2.0.0/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/xxx/pkg/pytorch2.0.0/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xxx/pkg/pytorch2.0.0/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
dlrm_s_pytorch.py FAILED
I used the command listed at README.md. I am wondering if that is no longer the correct command to run with (if so, what is the right command to run), or could you tell me more about what I did wrong?
Thanks in advance!
Best,
Yuxin
The text was updated successfully, but these errors were encountered:
Can you try the workaround suggested in the error message? in other words, rather than using args.local_rankhere, try printing and passing along os.environ['LOCAL_RANK'].
Hi Team,
I am able to run
python dlrm_s_pytorch.py --mini-batch-size=2 --data-size=6 --use-gpu
. But when I am trying to run dlrm_s_pytorch.py on single node multiple GPUs with nccl.Here is the command I used:
I got tons of errors:
I used the command listed at README.md. I am wondering if that is no longer the correct command to run with (if so, what is the right command to run), or could you tell me more about what I did wrong?
Thanks in advance!
Best,
Yuxin
The text was updated successfully, but these errors were encountered: