pip install -r requirements.txt
For horovod backend, in order to avoid the installation procedure, it is recommended to pull one of PyTorch-Ignite's docker image with pre-installed Horovod. It will include Horovod with gloo
controller and nccl
support.
docker run --gpus all -it -v $PWD:/workspace/project --network=host --shm-size 16G pytorchignite/hvd-vision:latest /bin/bash
cd project
# run horovod code snippets ...
For XLA/TPUs, one can run the scripts inside a Colab notebook.
Firstly, install the dependencies:
import os
assert os.environ['COLAB_TPU_ADDR'], 'Make sure to select TPU from Edit > Notebook settings > Hardware accelerator'
!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.8.1-cp37-cp37m-linux_x86_64.whl
!pip install -q --upgrade pytorch-ignite
Secondly download the scripts:
!rm -rf torch_native_xla.py && wget https://raw.githubusercontent.com/pytorch-ignite/idist-snippets/master/torch_xla_native.py
!rm -rf ignite_idist.py && wget https://raw.githubusercontent.com/pytorch-ignite/idist-snippets/master/ignite_idist.py
The code snippets highlight the API's specificities of each of the distributed backends on the same use case as compared to the idist
API. Torch native code is available for DDP, Horovod, and for XLA/TPU devices.
One can run each of the code snippets independently.
In this case idist Parallel
is using the native torch torch.multiprocessing.spawn
method under the hood in order to run the distributed configuration. Here nproc_per_node
is passed as a spawn argument.
- Running multiple distributed configurations with one code. Source: ignite_idist.py:
# Running with gloo
python -u ignite_idist.py --nproc_per_node 2 --backend gloo
# Running with nccl
python -u ignite_idist.py --nproc_per_node 2 --backend nccl
# Running with horovod with gloo controller ( gloo or nccl support )
python -u ignite_idist.py --backend horovod --nproc_per_node 2
# Running on xla/tpu
python -u ignite_idist.py --backend xla-tpu --nproc_per_node 8 --batch_size 32
PyTorch-Ignite's idist Parallel
context manager is also compatible with multiple distributed launchers.
Here we are using the torch.distributed.launch
script in order to spawn the processes:
python -m torch.distributed.launch --nproc_per_node 2 --use_env ignite_idist.py --backend gloo
horovodrun -np 16 -H hostname1:8,hostname2:8 python ignite_idist.py --backend horovod
In order to run this example and to avoid the installation procedure, you can pull one of PyTorch-Ignite's docker image with pre-installed Horovod. It will include Horovod with gloo
controller and nccl
support.
docker run --gpus all -it -v $PWD:/workspace/project --network=host --shm-size 16G pytorchignite/hvd-vision:latest /bin/bash
cd project
...
The same result can be achieved by using slurm
without any modification to the code:
srun --nodes=2
--ntasks-per-node=2
--job-name=pytorch-ignite
--time=00:01:00
--partition=gpgpu
--gres=gpu:2
--mem=10G
python ignite_idist.py --backend nccl
or using sbatch script.bash
with the script file script.bash
:
#!/bin/bash
#SBATCH --job-name=pytorch-ignite
#SBATCH --output=slurm_%j.out
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --time=00:01:00
#SBATCH --partition=gpgpu
#SBATCH --gres=gpu:2
#SBATCH --mem=10G
srun python ignite_idist.py --backend nccl
In order to run the same training loop on different backends without idist
you would have to use the different native torch snippets and associate a specific launch method for each of them.
- Run the
torch native
snippet with different backends:
# Running with gloo
python -u torch_native.py --nproc_per_node 2 --backend gloo
# Running with nccl
python -u torch_native.py --nproc_per_node 2 --backend nccl
- Run
horovod native
withgloo
controller andnccl
/gloo
supports
# Running with horovod with gloo controller ( gloo or nccl support )
python -u torch_horovod.py --nproc_per_node 2
- Run
torch xla native
snippet on tpa/xlu with :
# Run with a default of 8 processes
python -u torch_xla_native.py