Skip to content

pytorch-ignite/idist-snippets

Repository files navigation

Distributed Training Made Easy with PyTorch-Ignite: How to install and run

Installation

pip install -r requirements.txt

For horovod backend, in order to avoid the installation procedure, it is recommended to pull one of PyTorch-Ignite's docker image with pre-installed Horovod. It will include Horovod with gloo controller and nccl support.

docker run --gpus all -it -v $PWD:/workspace/project --network=host --shm-size 16G pytorchignite/hvd-vision:latest /bin/bash
cd project
# run horovod code snippets ...

For XLA/TPUs, one can run the scripts inside a Colab notebook.

Firstly, install the dependencies:

import os
assert os.environ['COLAB_TPU_ADDR'], 'Make sure to select TPU from Edit > Notebook settings > Hardware accelerator'
!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.8.1-cp37-cp37m-linux_x86_64.whl
!pip install -q --upgrade pytorch-ignite

Secondly download the scripts:

!rm -rf torch_native_xla.py && wget https://raw.githubusercontent.com/pytorch-ignite/idist-snippets/master/torch_xla_native.py
!rm -rf ignite_idist.py && wget https://raw.githubusercontent.com/pytorch-ignite/idist-snippets/master/ignite_idist.py

Running the code snippets:

The code snippets highlight the API's specificities of each of the distributed backends on the same use case as compared to the idist API. Torch native code is available for DDP, Horovod, and for XLA/TPU devices.

One can run each of the code snippets independently.

With torch.multiprocessing.spawn

In this case idist Parallel is using the native torch torch.multiprocessing.spawn method under the hood in order to run the distributed configuration. Here nproc_per_node is passed as a spawn argument.

  • Running multiple distributed configurations with one code. Source: ignite_idist.py:
# Running with gloo
python -u ignite_idist.py --nproc_per_node 2 --backend gloo

# Running with nccl
python -u ignite_idist.py --nproc_per_node 2 --backend nccl

# Running with horovod with gloo controller ( gloo or nccl support )
python -u ignite_idist.py --backend horovod --nproc_per_node 2

# Running on xla/tpu
python -u ignite_idist.py --backend xla-tpu --nproc_per_node 8 --batch_size 32

With Distributed launchers

PyTorch-Ignite's idist Parallel context manager is also compatible with multiple distributed launchers.

With torch.distributed.launch

Here we are using the torch.distributed.launch script in order to spawn the processes:

python -m torch.distributed.launch --nproc_per_node 2 --use_env ignite_idist.py --backend gloo

With horovodrun

horovodrun -np 16 -H hostname1:8,hostname2:8 python ignite_idist.py --backend horovod

In order to run this example and to avoid the installation procedure, you can pull one of PyTorch-Ignite's docker image with pre-installed Horovod. It will include Horovod with gloo controller and nccl support.

docker run --gpus all -it -v $PWD:/workspace/project --network=host --shm-size 16G pytorchignite/hvd-vision:latest /bin/bash
cd project
...

With slurm

The same result can be achieved by using slurm without any modification to the code:

srun --nodes=2
     --ntasks-per-node=2 
     --job-name=pytorch-ignite 
     --time=00:01:00  
     --partition=gpgpu 
     --gres=gpu:2
     --mem=10G 
     python ignite_idist.py --backend nccl

or using sbatch script.bash with the script file script.bash:

#!/bin/bash
#SBATCH --job-name=pytorch-ignite
#SBATCH --output=slurm_%j.out
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --time=00:01:00
#SBATCH --partition=gpgpu
#SBATCH --gres=gpu:2
#SBATCH --mem=10G

srun python ignite_idist.py --backend nccl

Running Torch native methods

In order to run the same training loop on different backends without idist you would have to use the different native torch snippets and associate a specific launch method for each of them.

Torch native DDP

  • Run the torch native snippet with different backends:
# Running with gloo 
python -u torch_native.py --nproc_per_node 2 --backend gloo

# Running with nccl
python -u torch_native.py --nproc_per_node 2 --backend nccl

Horovod

  • Run horovod native with gloo controller and nccl/gloo supports
# Running with horovod with gloo controller ( gloo or nccl support )
python -u torch_horovod.py --nproc_per_node 2

XLA/TPU devices

  • Run torch xla native snippet on tpa/xlu with :
# Run with a default of 8 processes 
python -u torch_xla_native.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages