Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fail to run dlrm_s_pytorch.py on single node multiple GPUs with nccl #359

Open
YuxinxinChen opened this issue Oct 4, 2023 · 1 comment

Comments

@YuxinxinChen
Copy link

Hi Team,

I am able to run python dlrm_s_pytorch.py --mini-batch-size=2 --data-size=6 --use-gpu. But when I am trying to run dlrm_s_pytorch.py on single node multiple GPUs with nccl.
Here is the command I used:

python -m torch.distributed.launch --nproc_per_node=2 dlrm_s_pytorch.py --arch-embedding-size="80000-80000-80000-80000-80000-80000-80000-80000" --arch-sparse-feature-size=64 --arch-mlp-bot="128-128-128-128" --arch-mlp-top="512-512-512-256-1" --max-ind-range=40000000 --data-generation=random --loss-function=bce --round-targets=True --learning-rate=1.0 --mini-batch-size=2048 --print-freq=2 --print-time --test-freq=2 --test-mini-batch-size=2048 --memory-map --use-gpu --num-batches=100 --dist-backend=nccl

I got tons of errors:

pytorch2.0.0/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Unable to import onnx.  No module named 'onnx'
usage: dlrm_s_pytorch.py [-h] [--arch-sparse-feature-size ARCH_SPARSE_FEATURE_SIZE] [--arch-embedding-size ARCH_EMBEDDING_SIZE] [--arch-mlp-bot ARCH_MLP_BOT] [--arch-mlp-top ARCH_MLP_TOP] [--arch-interaction-op {dot,cat}]
                         [--arch-interaction-itself] [--weighted-pooling WEIGHTED_POOLING] [--md-flag] [--md-threshold MD_THRESHOLD] [--md-temperature MD_TEMPERATURE] [--md-round-dims] [--qr-flag] [--qr-threshold QR_THRESHOLD]
                         [--qr-operation QR_OPERATION] [--qr-collisions QR_COLLISIONS] [--activation-function ACTIVATION_FUNCTION] [--loss-function LOSS_FUNCTION] [--loss-weights LOSS_WEIGHTS] [--loss-threshold LOSS_THRESHOLD]
                         [--round-targets ROUND_TARGETS] [--data-size DATA_SIZE] [--num-batches NUM_BATCHES] [--data-generation DATA_GENERATION] [--rand-data-dist RAND_DATA_DIST] [--rand-data-min RAND_DATA_MIN]
                         [--rand-data-max RAND_DATA_MAX] [--rand-data-mu RAND_DATA_MU] [--rand-data-sigma RAND_DATA_SIGMA] [--data-trace-file DATA_TRACE_FILE] [--data-set DATA_SET] [--raw-data-file RAW_DATA_FILE]
                         [--processed-data-file PROCESSED_DATA_FILE] [--data-randomize DATA_RANDOMIZE] [--data-trace-enable-padding DATA_TRACE_ENABLE_PADDING] [--max-ind-range MAX_IND_RANGE]
                         [--data-sub-sample-rate DATA_SUB_SAMPLE_RATE] [--num-indices-per-lookup NUM_INDICES_PER_LOOKUP] [--num-indices-per-lookup-fixed NUM_INDICES_PER_LOOKUP_FIXED] [--num-workers NUM_WORKERS] [--memory-map]
                         [--mini-batch-size MINI_BATCH_SIZE] [--nepochs NEPOCHS] [--learning-rate LEARNING_RATE] [--print-precision PRINT_PRECISION] [--numpy-rand-seed NUMPY_RAND_SEED] [--sync-dense-params SYNC_DENSE_PARAMS]
                         [--optimizer OPTIMIZER] [--dataset-multiprocessing] [--inference-only] [--quantize-mlp-with-bit QUANTIZE_MLP_WITH_BIT] [--quantize-emb-with-bit QUANTIZE_EMB_WITH_BIT] [--save-onnx] [--use-gpu]
                         [--local_rank LOCAL_RANK] [--dist-backend DIST_BACKEND] [--print-freq PRINT_FREQ] [--test-freq TEST_FREQ] [--test-mini-batch-size TEST_MINI_BATCH_SIZE] [--test-num-workers TEST_NUM_WORKERS] [--print-time]
                         [--print-wall-time] [--debug-mode] [--enable-profiling] [--plot-compute-graph] [--tensor-board-filename TENSOR_BOARD_FILENAME] [--save-model SAVE_MODEL] [--load-model LOAD_MODEL] [--mlperf-logging]
                         [--mlperf-acc-threshold MLPERF_ACC_THRESHOLD] [--mlperf-auc-threshold MLPERF_AUC_THRESHOLD] [--mlperf-bin-loader] [--mlperf-bin-shuffle] [--mlperf-grad-accum-iter MLPERF_GRAD_ACCUM_ITER]
                         [--lr-num-warmup-steps LR_NUM_WARMUP_STEPS] [--lr-decay-start-step LR_DECAY_START_STEP] [--lr-num-decay-steps LR_NUM_DECAY_STEPS]
dlrm_s_pytorch.py: error: unrecognized arguments: --local-rank=1
Unable to import onnx.  No module named 'onnx'
usage: dlrm_s_pytorch.py [-h] [--arch-sparse-feature-size ARCH_SPARSE_FEATURE_SIZE] [--arch-embedding-size ARCH_EMBEDDING_SIZE] [--arch-mlp-bot ARCH_MLP_BOT] [--arch-mlp-top ARCH_MLP_TOP] [--arch-interaction-op {dot,cat}]
                         [--arch-interaction-itself] [--weighted-pooling WEIGHTED_POOLING] [--md-flag] [--md-threshold MD_THRESHOLD] [--md-temperature MD_TEMPERATURE] [--md-round-dims] [--qr-flag] [--qr-threshold QR_THRESHOLD]
                         [--qr-operation QR_OPERATION] [--qr-collisions QR_COLLISIONS] [--activation-function ACTIVATION_FUNCTION] [--loss-function LOSS_FUNCTION] [--loss-weights LOSS_WEIGHTS] [--loss-threshold LOSS_THRESHOLD]
                         [--round-targets ROUND_TARGETS] [--data-size DATA_SIZE] [--num-batches NUM_BATCHES] [--data-generation DATA_GENERATION] [--rand-data-dist RAND_DATA_DIST] [--rand-data-min RAND_DATA_MIN]
                         [--rand-data-max RAND_DATA_MAX] [--rand-data-mu RAND_DATA_MU] [--rand-data-sigma RAND_DATA_SIGMA] [--data-trace-file DATA_TRACE_FILE] [--data-set DATA_SET] [--raw-data-file RAW_DATA_FILE]
                         [--processed-data-file PROCESSED_DATA_FILE] [--data-randomize DATA_RANDOMIZE] [--data-trace-enable-padding DATA_TRACE_ENABLE_PADDING] [--max-ind-range MAX_IND_RANGE]
                         [--data-sub-sample-rate DATA_SUB_SAMPLE_RATE] [--num-indices-per-lookup NUM_INDICES_PER_LOOKUP] [--num-indices-per-lookup-fixed NUM_INDICES_PER_LOOKUP_FIXED] [--num-workers NUM_WORKERS] [--memory-map]
                         [--mini-batch-size MINI_BATCH_SIZE] [--nepochs NEPOCHS] [--learning-rate LEARNING_RATE] [--print-precision PRINT_PRECISION] [--numpy-rand-seed NUMPY_RAND_SEED] [--sync-dense-params SYNC_DENSE_PARAMS]
                         [--optimizer OPTIMIZER] [--dataset-multiprocessing] [--inference-only] [--quantize-mlp-with-bit QUANTIZE_MLP_WITH_BIT] [--quantize-emb-with-bit QUANTIZE_EMB_WITH_BIT] [--save-onnx] [--use-gpu]
                         [--local_rank LOCAL_RANK] [--dist-backend DIST_BACKEND] [--print-freq PRINT_FREQ] [--test-freq TEST_FREQ] [--test-mini-batch-size TEST_MINI_BATCH_SIZE] [--test-num-workers TEST_NUM_WORKERS] [--print-time]
                         [--print-wall-time] [--debug-mode] [--enable-profiling] [--plot-compute-graph] [--tensor-board-filename TENSOR_BOARD_FILENAME] [--save-model SAVE_MODEL] [--load-model LOAD_MODEL] [--mlperf-logging]
                         [--mlperf-acc-threshold MLPERF_ACC_THRESHOLD] [--mlperf-auc-threshold MLPERF_AUC_THRESHOLD] [--mlperf-bin-loader] [--mlperf-bin-shuffle] [--mlperf-grad-accum-iter MLPERF_GRAD_ACCUM_ITER]
                         [--lr-num-warmup-steps LR_NUM_WARMUP_STEPS] [--lr-decay-start-step LR_DECAY_START_STEP] [--lr-num-decay-steps LR_NUM_DECAY_STEPS]
dlrm_s_pytorch.py: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 375622) of binary: /home/xxx/.conda/envs/torch2.0/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/xxxpkg/pytorch2.0.0/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/xxx/pkg/pytorch2.0.0/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/xxx/pkg/pytorch2.0.0/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/xxx/pkg/pytorch2.0.0/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/xxx/pkg/pytorch2.0.0/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/pkg/pytorch2.0.0/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
dlrm_s_pytorch.py FAILED

I used the command listed at README.md. I am wondering if that is no longer the correct command to run with (if so, what is the right command to run), or could you tell me more about what I did wrong?

Thanks in advance!
Best,
Yuxin

@mnaumovfb
Copy link
Contributor

Can you try the workaround suggested in the error message? in other words, rather than using args.local_rank here, try printing and passing along os.environ['LOCAL_RANK'].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants