Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Activating UVM Function in torchrec_dlrm #368

Open
JhengLu opened this issue Dec 3, 2023 · 0 comments
Open

Issue with Activating UVM Function in torchrec_dlrm #368

JhengLu opened this issue Dec 3, 2023 · 0 comments

Comments

@JhengLu
Copy link

JhengLu commented Dec 3, 2023

Hi,

I encountered an issue while running the torchrec_dlrm/dlrm.main command with the provided parameters. It seems that the UVM function is not being activated properly. I've shared the command and observed different behaviors when setting the reservation rate.

Command:

CUDA_VISIBLE_DEVICES=0 numactl --cpunodebind=0 bash -c \
    'export PREPROCESSED_DATASET=./criteo-research-kaggle-output && \
    export GLOBAL_BATCH_SIZE=262144 && \
    export WORLD_SIZE=2 && \
    torchx run -s local_cwd dist.ddp -j 1x1 --script dlrm_main.py -- \
        --in_memory_binary_criteo_path $PREPROCESSED_DATASET \
        --pin_memory \
        --batch_size $((GLOBAL_BATCH_SIZE / WORLD_SIZE)) \
        --learning_rate 1.0 \
        --dataset_name criteo_kaggle \
        --embedding_dim 1024 \
        --dense_arch_layer_sizes 10240,10240,1024 \
        --over_arch_layer_sizes 4096,4096,4096,1 \
        --print_sharding_plan \
        --num_embeddings_per_feature 163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840,163840 2>&1 | tee log/0G_500G_16384feature_numactl.log'
  • Issue Details:

    • The UVM function does not seem to be activated properly.
    • When setting the reservation rate to 0.49, an error is encountered.
      MicrosoftTeams-image
    • Conversely, when setting the reservation rate to 0.45, a different error is observed.MicrosoftTeams-image (1)
  • Request for Guidance:

    • Can you provide guidance on how to properly activate the UVM function?
    • Specifically, how to address the errors mentioned above?
    • My understanding is that if GPU memory is insufficient, the UVM mechanism should move part of the embeddings to CPU memory, preventing such errors.

Thank you for your assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant