how to train dlrm with multi-gpu #354

DONGDILLON · 2023-08-11T09:03:49Z

Hi, I have used 8 gpus to train dlrm recently. The command I use is python3 -m torch.distributed.launch --nproc_per_node 4 python3 dlrm_s_pytorch.py --arch-sparse-feature-size=64 --arch-mlp-bot="13-512-256-64" --arch-mlp-top="512-512-256-1" --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --raw-data-file=./input/day --processed-data-file=./input/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=2048 --print-freq=1024 --print-time --test-mini-batch-size=16384 --test-num-workers=16 --use-gpu--dist-backend='nccl However it cannot build connection within multi-gpu. Please help

The text was updated successfully, but these errors were encountered:

mnaumovfb · 2023-11-27T00:59:59Z

How many GPUs on the machine do you have? Can you try the command from the readme (Benchmarking, Section 5 "The code now supports synchronous distributed training ..." and share the error message?
# for single node 8 gpus and nccl as backend on randomly generated dataset:
python -m torch.distributed.launch --nproc_per_node=8 dlrm_s_pytorch.py --arch-embedding-size="80000-80000-80000-80000-80000-80000-80000-80000" --arch-sparse-feature-size=64 --arch-mlp-bot="128-128-128-128" --arch-mlp-top="512-512-512-256-1" --max-ind-range=40000000 --data-generation=random --loss-function=bce --round-targets=True --learning-rate=1.0 --mini-batch-size=2048 --print-freq=2 --print-time --test-freq=2 --test-mini-batch-size=2048 --memory-map --use-gpu --num-batches=100 --dist-backend=nccl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to train dlrm with multi-gpu #354

how to train dlrm with multi-gpu #354

DONGDILLON commented Aug 11, 2023

mnaumovfb commented Nov 27, 2023 •

edited

Loading

how to train dlrm with multi-gpu #354

how to train dlrm with multi-gpu #354

Comments

DONGDILLON commented Aug 11, 2023

mnaumovfb commented Nov 27, 2023 • edited Loading

mnaumovfb commented Nov 27, 2023 •

edited

Loading