Skip to content
forked from Flasew/dlrm

An implementation of a deep learning recommendation model (DLRM)

License

Notifications You must be signed in to change notification settings

hipersys-team/dlrm

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DLRM for Personalization and Recommandation systems

This submodule contains the DLRM model used to generate TopoOpt's testbed DLRM result. It was adopted from Meta's implementaiton of DLRM with model parallelism. Please check the original README file at here for a more detailed description.

Instructions

The scripts are designed to run distributed model parallel training of DLRM on MIT's 12 node cluster testbed. Please check this document for how to setup RDMA forwarding and using the hacked version of NCCL.

The scripts uses the pytorch version of this repository.

run_a100_fattree.sh will run DLRM training on the Mellanox ConnectX5 NIC, which are connected to a single switch.

run_a100_topoopt.sh will run DLRM training on the HPE nics, which are connected to the patch panel. Please be sure the forwarding is setup properly before running this test. Check the document above on how to setup RDMA fowarding.

To execute the program, adjust the parameters in the scripts and run the script on ALL of the workers. The script will automatically pick the master and log the training output on the master machine.

About

An implementation of a deep learning recommendation model (DLRM)

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.6%
  • Shell 4.1%
  • Jupyter Notebook 1.2%
  • Dockerfile 0.1%