Skip to content

Latest commit

 

History

History
98 lines (81 loc) · 4.3 KB

File metadata and controls

98 lines (81 loc) · 4.3 KB

ResNet50 v1.5 BFloat16 training

This document has instructions for running ResNet50 v1.5 BFloat16 training using Intel-optimized TensorFlow.

Datasets

Note that the ImageNet dataset is used in these ResNet50 v1.5 examples. Download and preprocess the ImageNet dataset using the instructions here. After running the conversion script you should have a directory with the ImageNet dataset in the TF records format.

Set the DATASET_DIR to point to this directory when running ResNet50 v1.5.

Quick Start Scripts

Script name Description
bfloat16_training_demo.sh Executes a short run using small batch sizes and a limited number of steps to demonstrate the training flow
bfloat16_training_1_epoch.sh Executes a test run that trains the model for 1 epoch and saves checkpoint files to an output directory.
bfloat16_training_full.sh Trains the model using the full dataset and runs until convergence (90 epochs) and saves checkpoint files to an output directory. Note that this will take a considerable amount of time.

Run the model

Setup your environment using the instructions below, depending on if you are using AI Kit:

Setup using AI Kit Setup without AI Kit

To run using AI Kit you will need:

  • numactl
  • openmpi-bin (only required for multi-instance)
  • openmpi-common (only required for multi-instance)
  • openssh-client (only required for multi-instance)
  • openssh-server (only required for multi-instance)
  • libopenmpi-dev (only required for multi-instance)
  • horovod==0.21.0 (only required for multi-instance)
  • Activate the tensorflow conda environment
    conda activate tensorflow

To run without AI Kit you will need:

  • Python 3
  • [intel-tensorflow>=2.5.0](https://pypi.org/project/intel-tensorflow/)
  • git
  • numactl
  • openmpi-bin (only required for multi-instance)
  • openmpi-common (only required for multi-instance)
  • openssh-client (only required for multi-instance)
  • openssh-server (only required for multi-instance)
  • libopenmpi-dev (only required for multi-instance)
  • horovod==0.21.0 (only required for multi-instance)
  • A clone of the Model Zoo repo
    git clone https://github.com/IntelAI/models.git

After finishing the setup above, set environment variables for the path to your DATASET_DIR for ImageNet and an OUTPUT_DIR where log files and checkpoints will be written. Navigate to your model zoo directory and then run a quickstart script.

# cd to your model zoo directory
cd models

export DATASET_DIR=<path to the ImageNet TF records>
export OUTPUT_DIR=<path to the directory where log files and checkpoints will be written>
# For a custom batch size, set env var `BATCH_SIZE` or it will run with a default value.
export BATCH_SIZE=<customized batch size value>

./quickstart/image_recognition/tensorflow/resnet50v1_5/training/cpu/bfloat16/<script name>.sh

Additional Resources