In this example, we provide script and tools to perform reproducible experiments on training neural networks on PASCAL VOC2012 dataset.
Features:
- Distributed training with native automatic mixed precision
- Experiments tracking with ClearML
Experiment | Model | Dataset | Val Avg IoU | ClearML Link |
---|---|---|---|---|
configs/baseline_dplv3_resnet101.py | DeepLabV3 Resnet101 | VOC Only | 0.659161 | link |
configs/baseline_dplv3_resnet101_sbd.py | DeepLabV3 Resnet101 | VOC+SBD | 0.6853087 | link |
pip install -r requirements.txt
For docker users, you can use the following images to run the example:
docker pull pytorchignite/vision:latest
or
docker pull pytorchignite/hvd-vision:latest
and install other requirements as suggested above
We do not add horovod
as a requirement into requirements.txt
. Please, install it manually following the official guides or
use pytorchignite/hvd-vision:latest
docker image.
Download and extract the datasets:
python main.py download /path/to/datasets
This script will download and extract the following datasets into /path/to/datasets
- The Pascal VOC2012 dataset
- Optionally, the SBD evaluation dataset
Please, export the DATASET_PATH
environment variable for the Pascal VOC2012 dataset.
export DATASET_PATH=/path/to/pascal_voc2012
# e.g. export DATASET_PATH=/data/ where VOCdevkit is located
Optionally, if using SBD dataset, export the SBD_DATASET_PATH
environment variable:
export SBD_DATASET_PATH=/path/to/SBD/
# e.g. export SBD_DATASET_PATH=/data/SBD/ where "cls img inst train.txt train_noval.txt val.txt" are located
- Adjust batch size for your GPU type in the configuration file:
configs/baseline_dplv3_resnet101_sbd.py
orconfigs/baseline_dplv3_resnet101.py
Run the following command:
CUDA_VISIBLE_DEVICES=0 python -u main.py training configs/baseline_dplv3_resnet101_sbd.py
# or without SBD
# CUDA_VISIBLE_DEVICES=0 python -u main.py training configs/baseline_dplv3_resnet101.py
- Adjust total batch size for your GPUs in the configuration file:
configs/baseline_dplv3_resnet101_sbd.py
orconfigs/baseline_dplv3_resnet101.py
torchrun --nproc_per_node=2 main.py training configs/baseline_dplv3_resnet101_sbd.py
# or without SBD
# torchrun --nproc_per_node=2 main.py training configs/baseline_dplv3_resnet101.py
- Adjust total batch size for your GPUs in the configuration file:
configs/baseline_dplv3_resnet101_sbd.py
orconfigs/baseline_dplv3_resnet101.py
horovodrun -np=2 python -u main.py training configs/baseline_dplv3_resnet101_sbd.py --backend="horovod"
# or without SBD
# horovodrun -np=2 python -u main.py training configs/baseline_dplv3_resnet101.py --backend="horovod"
CUDA_VISIBLE_DEVICES=0 python -u main.py eval configs/eval_baseline_dplv3_resnet101_sbd.py
torchrun --nproc_per_node=2 main.py eval configs/eval_baseline_dplv3_resnet101_sbd.py
horovodrun -np=2 python -u main.py eval configs/eval_baseline_dplv3_resnet101_sbd.py --backend="horovod"
Trainings were done using credits provided by AWS for open-source development via NumFOCUS and using trainml.ai platform.