Scaling up training

If you have completed the training section of our tutorial on pixel-level classification, you've trained an image segmentation model for one epoch using two GPUs and eight training image pairs in <20 minutes. For our work with the Chesapeake Conservancy, we trained a similar model for 250 epochs using 148 GPUs and 740 training image pairs -- all in a little over two hours. This section describes how this scaling was achieved and considerations the reader might apply to their own projects.

Increasing worker (GPU) count

Decreasing epoch length

Doubling the number of workers would ideally decrease the time required to train the model by half. In practice, the actual speed-up is lower due to overhead in communication between workers and our use of a synchronous training method. We show below that we achieved near-linear speed-ups between 1 and 64 workers, improvements in training time eventually becoming more marginal. (We report the average epoch duration after training data load to memory, a time-intensive step which does not scale with worker number.)

We expect that the following modifications would further improve training time for large clusters (though we did not pursue them for this use case):

Using workers connected by Infiniband (e.g. the NC24r Azure VM SKU) to speed up communication between workers
Using 1-bit Stochastic Gradient Descent to decrease the size of messages passed between workers
Using blockwise model update and filtering, aka block momentum during training to decrease the frequency of communication between workers
Tuning minibatch size to decrease frequency of communication between workers
Pursuing an asynchronous training approach

Permitting data load to memory

Increasing worker count is also beneficial when it permits the dataset to be stored entirely in memory. Accessing data from a remote store, or even from disk, can be rate-limiting for training, so it is ideal for each worker to perform an initial data load and then access data from memory in subsequent rounds of training. This becomes achievable using data-parallel training when the number of workers is sufficiently large.

How to implement

To increase the number of worker nodes in your cluster during deployment, simply modify the "targetNodeCount" and "vmSize" values in the cluster.json file. We recommend that you use a VM SKU with a larger number of GPUs where possible, e.g. create a cluster with four NC24 VMs rather than a cluster with sixteen NC6 VMs. This option will reduce the average communication time between workers and will not impact the average memory/CPU/storage per worker.

Data access

Network File System

At the time that we performed full model training for the Chesapeake Conservancy (9/2017), Azure Batch AI did not yet offer data access from blob storage via blobfuse. Instead, we provisioned a Network File System (NFS) to host our data for concurrent access by many workers. This option is preferable to storing data on an Azure File Share, but we believe that accessing data from blob storage (as demonstrated in this tutorial) will now be preferable for most users.

If you would like to try using an NFS as your data store, you may modify the setup steps in setup.md to create a file server and mount it on a new cluster. You may wish to change the username and password for both file server and Batch AI cluster to credentials of your choosing.

az batchai file-server create -n batchaidemo -u yourusername -p yourpassword --vm-size Standard_D2_V2 --disk-count 1 --disk-size 1000 --storage-sku Standard_LRS
for /f "delims=" %a in ('az batchai file-server list -g %AZURE_RESOURCE_GROUP% --query "[?name == 'batchaidemo'].mountSettings.fileServerPublicIp | [0]"') do @set AZURE_BATCH_AI_TRAINING_NFS_IP=%a
echo %AZURE_BATCH_AI_TRAINING_NFS_IP%
az batchai cluster create -n batchaidemo -u lcuser -p lcpassword --afs-name batchai --nfs batchaidemo --image UbuntuDSVM --vm-size STANDARD_NC6 --max 2 --min 2 --storage-account-name %STORAGE_ACCOUNT_NAME% --container-name blobfuse --container-mount-path blobfuse -c cluster.json

You can then use your favorite SSH or SCP agent to upload your data files under the /mnt/data directory of the file server, so that they will be accessible from your cluster. The IP address for your NFS will have been printed to your command prompt by the echo command above.

You may also wish to use a premium storage SKU (learn more from the output of az batchai file-server create -h), choose another VM SKU, or increase the disk count/size to improve the performance of your file server.

Next steps

As you adapt this example to your own needs, you may find the following resources helpful:

Batch AI resources

Batch AI Recipes repository

Demonstrates how to use other deep learning frameworks (e.g. TensorFlow, Caffe, Keras, Chainer) with Batch AI. Also includes examples on how to interact with Batch AI through its SDKs.
Batch AI quick-start guide
Batch AI CLI full documentation

CNTK distributed training resources

Guide to distributed training with CNTK
Example single-GPU vs. distributed training scripts for CNTK -- see also other examples in this repository
Guide to developing a UserMinibatchSource for custom data types
CNTK Python API

Training data resources

National Agriculture Imagery Program

More information on the aerial imagery used for this project.
USDA Geospatial Data Gateway

Where you can obtain NAIP data for your region/year of interest.
LizardTech's free GeoExpress Command Line applications

Can be used to convert from MrSID-formatted NAIP data to TIF format
GDAL

Divide large TIF files so that they can be loaded into memory during training (e.g. using gdal-translate).

We are not able to redistribute the full 1-meter resolution land use classification dataset created by the Chesapeake Conservancy for the Chesapeake Bay watershed. A few alternative sources of land use data are provided below (though most are not available at 1-meter resolution).

National Land Cover Database (30-meter resolution, country-wide)
USGS Gap Analysis Project (30-meter resolution, country-wide)
NOAA C-CAP FTP site (30-meter resolution with 1 to 5-meter resolution available in some areas; limited to coastal regions)

Click here to return to the main page of this repository for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scaling.md

scaling.md

Scaling up training

Increasing worker (GPU) count

Decreasing epoch length

Permitting data load to memory

How to implement

Data access

Network File System

Next steps

Batch AI resources

CNTK distributed training resources

Training data resources

Files

scaling.md

Latest commit

History

scaling.md

File metadata and controls

Scaling up training

Increasing worker (GPU) count

Decreasing epoch length

Permitting data load to memory

How to implement

Data access

Network File System

Next steps

Batch AI resources

CNTK distributed training resources

Training data resources