Distributed Training Made Easy with PyTorch-Ignite: How to install and run

Installation

pip install -r requirements.txt

For horovod backend, in order to avoid the installation procedure, it is recommended to pull one of PyTorch-Ignite's docker image with pre-installed Horovod. It will include Horovod with gloo controller and nccl support.

docker run --gpus all -it -v $PWD:/workspace/project --network=host --shm-size 16G pytorchignite/hvd-vision:latest /bin/bash
cd project
# run horovod code snippets ...

For XLA/TPUs, one can run the scripts inside a Colab notebook.

Firstly, install the dependencies:

import os
assert os.environ['COLAB_TPU_ADDR'], 'Make sure to select TPU from Edit > Notebook settings > Hardware accelerator'
!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.8.1-cp37-cp37m-linux_x86_64.whl
!pip install -q --upgrade pytorch-ignite

Secondly download the scripts:

!rm -rf torch_native_xla.py && wget https://raw.githubusercontent.com/pytorch-ignite/idist-snippets/master/torch_xla_native.py
!rm -rf ignite_idist.py && wget https://raw.githubusercontent.com/pytorch-ignite/idist-snippets/master/ignite_idist.py

Running the code snippets:

The code snippets highlight the API's specificities of each of the distributed backends on the same use case as compared to the idist API. Torch native code is available for DDP, Horovod, and for XLA/TPU devices.

One can run each of the code snippets independently.

With `torch.multiprocessing.spawn`

In this case idist Parallel is using the native torch torch.multiprocessing.spawn method under the hood in order to run the distributed configuration. Here nproc_per_node is passed as a spawn argument.

Running multiple distributed configurations with one code. Source: ignite_idist.py:

# Running with gloo
python -u ignite_idist.py --nproc_per_node 2 --backend gloo

# Running with nccl
python -u ignite_idist.py --nproc_per_node 2 --backend nccl

# Running with horovod with gloo controller ( gloo or nccl support )
python -u ignite_idist.py --backend horovod --nproc_per_node 2

# Running on xla/tpu
python -u ignite_idist.py --backend xla-tpu --nproc_per_node 8 --batch_size 32

With Distributed launchers

PyTorch-Ignite's idist Parallel context manager is also compatible with multiple distributed launchers.

With torch.distributed.launch

Here we are using the torch.distributed.launch script in order to spawn the processes:

python -m torch.distributed.launch --nproc_per_node 2 --use_env ignite_idist.py --backend gloo

With horovodrun

horovodrun -np 16 -H hostname1:8,hostname2:8 python ignite_idist.py --backend horovod

In order to run this example and to avoid the installation procedure, you can pull one of PyTorch-Ignite's docker image with pre-installed Horovod. It will include Horovod with gloo controller and nccl support.

docker run --gpus all -it -v $PWD:/workspace/project --network=host --shm-size 16G pytorchignite/hvd-vision:latest /bin/bash
cd project
...

With slurm

The same result can be achieved by using slurm without any modification to the code:

srun --nodes=2
     --ntasks-per-node=2 
     --job-name=pytorch-ignite 
     --time=00:01:00  
     --partition=gpgpu 
     --gres=gpu:2
     --mem=10G 
     python ignite_idist.py --backend nccl

or using sbatch script.bash with the script file script.bash:

#!/bin/bash
#SBATCH --job-name=pytorch-ignite
#SBATCH --output=slurm_%j.out
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --time=00:01:00
#SBATCH --partition=gpgpu
#SBATCH --gres=gpu:2
#SBATCH --mem=10G

srun python ignite_idist.py --backend nccl

Running Torch native methods

In order to run the same training loop on different backends without idist you would have to use the different native torch snippets and associate a specific launch method for each of them.

Torch native DDP

Run the torch native snippet with different backends:

# Running with gloo 
python -u torch_native.py --nproc_per_node 2 --backend gloo

# Running with nccl
python -u torch_native.py --nproc_per_node 2 --backend nccl

Horovod

Run horovod native with gloo controller and nccl/gloo supports

# Running with horovod with gloo controller ( gloo or nccl support )
python -u torch_horovod.py --nproc_per_node 2

XLA/TPU devices

Run torch xla native snippet on tpa/xlu with :

# Run with a default of 8 processes 
python -u torch_xla_native.py

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
blog		blog
.gitignore		.gitignore
README.md		README.md
ignite_idist.py		ignite_idist.py
requirements.txt		requirements.txt
torch_horovod.py		torch_horovod.py
torch_native.py		torch_native.py
torch_xla_native.py		torch_xla_native.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Training Made Easy with PyTorch-Ignite: How to install and run

Installation

Running the code snippets:

With `torch.multiprocessing.spawn`

With Distributed launchers

With torch.distributed.launch

With horovodrun

With slurm

Running Torch native methods

Torch native DDP

Horovod

XLA/TPU devices

About

Releases

Packages

Contributors 3

Languages

pytorch-ignite/idist-snippets

Folders and files

Latest commit

History

Repository files navigation

Distributed Training Made Easy with PyTorch-Ignite: How to install and run

Installation

Running the code snippets:

With torch.multiprocessing.spawn

With Distributed launchers

With torch.distributed.launch

With horovodrun

With slurm

Running Torch native methods

Torch native DDP

Horovod

XLA/TPU devices

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

With `torch.multiprocessing.spawn`

Packages