A GBDX task that takes GeoTiff image chips (output from the chip-from-vrt task) and produces a trained Convolutional Neural Network (CNN) classifier. The network architecture is VGG Net, which was developed as part of the 2014 ImageNet challenge.
Here we execute an example in which a classifier is trained to find buildings in Nigeria from image chips stored on S3. Note that there is also a geojson 'ref.geojson' in this location with geometries, feature ids, and class names ('No Buildings' or 'Buildings') for each chip, which will be used to train the network. The chips and reference geojson are all outputs of the chip-from-vrt task.
-
In a Python terminal create a GBDX interface and specify the task input location:
from gbdxtools import Interface from os.path import join import uuid gbdx = Interface() input_location = 's3://gbd-customer-data/32cbab7a-4307-40c8-bb31-e2de32f940c2/platform-stories/train-cnn-chip-classifier/'
-
Create a task instance and set the required inputs:
train_task = gbdx.Task('train-cnn-chip-classifier') train_task.inputs.chips = join(input_location, 'chips') train_task.inputs.classes = 'No Buildings, Buildings' train_task.inputs.max_side_dim = '270' train_task.inputs.resize_dim = '150'
-
Set any optional hyper-parameters if necessary. With the following parameters training should take about four hours to complete:
train_task.inputs.two_rounds = 'False' train_task.inputs.nb_epoch = '50' train_task.inputs.train_size = '2500' train_task.inputs.test_size = '2000'
-
Initialize a workflow and specify where to save the output:
train_wf = gbdx.Workflow([train_task]) random_str = str(uuid.uuid4()) output_location = join('platform-stories/trial-runs', random_str) train_wf.savedata(train_task.outputs.trained_model, join(output_location, 'trained_model'))
-
Execute the workflow:
train_wf.execute()
-
Track the status of the workflow as follows:
train_wf.status
The task input ports. Note that booleans, integers and floats must be passed to the task as strings, e.g., 'True', '10', '0.001'.
Name | Type | Description: | Required |
---|---|---|---|
chips | Directory | Contains GeoTiff image chips (feature_id.tif) produced from chips-s3-imagery. Should also contain a geojson 'ref.geojson' with class names for each chip (referenced by feature id). | True |
classes | String | Classes to train network on, each separated by a comma (e.g- 'No Buildings, Buildings'). Must be exactly as they appear in the class_name property of the geojson used to create the chips in the chip-from-vrt task. | True |
min_side_dim | String | The minimum acceptable side dimension (in pixels) for training chips. Defaults to 0. | False |
max_side_dim | String | The maximum acceptable side dimension (in pixels) for training chips. Defaults to 150. | False |
resize_dim | String | Dimension to resize the chip side dimensions to. This should be smaller than the original dimension. Defaults to None | False |
two_rounds | String | If True, train the network in two rounds- first on balanced classes then on the original distribution of classes. In the second training round only the weights of the final layer of the model will be updated. Recommended if there is class imbalance in the dataset. Defaults to False | False |
train_size | String | Number of chips to train the network on for the first round of training. Defaults to 10000. | False |
train_size_2 | String | Number of chips to train on in the second round of training. Only relevant if two_rounds is True. Defaults to half of train_size | False |
batch_size | String | Number of chips to train on per batch. Defaults to 32 | False |
nb_epoch | String | Number of training epochs to perform for the first round of training. Defaults to 35 | False |
nb_epoch_2 | String | Number of training epochs to perform for the second round of training. Only relevant if two_rounds is True. Defatults to 8. | False |
use_lowest_val_loss | String | After the first round of training use the model weights that yielded the lowest validation loss as the final model (recommended). Otherwise the model weights after the final epoch will be used. Defaults to True | False |
test | String | If True testing will be completed on a subset of chips. A test report with accuracy metrics will be saved as a text file in the 'model' output directory. Defaults to True. | False |
test_size | String | Number of chips to test on. Only relevant if test is True. Defaults to 1000. | False |
learning_rate | String | Learning rate for the first round of training. Defaults to 0.001 | False |
learning_rate_2 | String | Learning rate for the second round of training (if applicable). Defaults to 0.01. | False |
max_pixel_intensity | String | Maximum pixel intensity in input chips. Defaults to 255. | False |
kernel_size | String | Side dimension (in pixels) of the kernels at each convolutional layer in the network. Defaults to 3. | False |
small model | String | Use a model with 8 layers instead of 16. Useful for large input images (>250 pixels). Defaults to False. | False |
train-cnn-chip-classifier has one output directory port, trained_model, which is described below:
Name | Type | Description: |
---|---|---|
trained_model | Directory | The fully trained model saved as an h5 file. Models after all epochs will be saved in a subdirectory called model_weights, with the format epoch{n}_{validation_loss}.h5, where n is the epoch that produced the weights. This directory will also contain a test report, which will only have accuracy metrics if train is True. |
This section contains additional information that provides further insight into training parameters and suggestions for training an effective model.
The two_rounds flag is a method for dealing with data that has natural class imbalance (unequal representation of image classes). Training a CNN on data with unbalanced classes often results in the network classifying all target data under the majority class. two_rounds avoids this by training in the following steps:
-
Train the network on balanced data (the task will take care of creating a balanced training dataset).
-
Retrain the network on the original class distribution to account for the probability of encountering a given image class. This round will only update the weights to the output layer of the network.
This two-round training process allows the network to learn to distinguish between classes based on distinct features (round one) and then learn the probability of encountering each class (round two). This is highly recommended for data that is not balanced.
Constraints on the minimum and maximum size (in pixels) of a chip. Side dimensions are based on the size of chips bounding box (white line below).
A sample input chip is displayed above. Note that CNNs require all train/test inputs to have identical dimensions. Thus, all chips are zero-padded to the following dimensions: (num_bands, max_side_dim, max_side_dim). If resize_dim is set, the chips will then be resized to (num_bands, resize_dim, resize_dim). This means that any chips that have a side dimension larger than max_side_dim will not be used during training.
Number of chips to train on. If not enough chips are provided the task will throw an error.
Notice that if training takes place in two_rounds, the maximum train size will be as follows: size of smallest class * number of classes, assuming that all chips are of valid size (as defined by min_side_dim and max_side_dim). Additionally if testing is performed the test data will be subtracted from the available training chips.
Number of chips to train on per batch. The model weights will be updated following each batch. Smaller batch sizes can help avoid local minima by increasing the amount of noise in the gradient.
Number of training epochs to complete. The validation loss tends to decrease with each successive epoch until a minimum loss is reached. At this point any additional training epochs may result in overfitting.
While the validation loss of the model tends to decrease with successive training epochs, it rarely does so monotonically. Furthermore, if too many training epochs are completed the model may begin to overfit, causing the validation loss to increase with successive epochs. The loss may therefore not be at a minimum at the end of training. Use this flag to ensure that the initial round returns a model with the lowest possible validation loss.
Note that all model weights will be returned in the model_weights folder of the output directory.
Testing should only be done on two-class classifications. The following explanations assume the classes are input to the task as follows: 'Negative class, Positive class'.
If the test flag is set to True, the task will put aside a set of chips to get accuracy metrics for the trained model. Set this to False if you would like to complete testing manually.
The following metrics are provided in test results:
- Precision: True Positives / (True Positives + False Positives)
- Recall: True Positives / (True Positives + False Negatives)
- F1 score: (2 * Precision * Recall) / (Precision + Recall)
Each convolutional layer of a CNN uses kernels to extract features from the input image to create an output feature map that is passed to the next layer. This parameter specifies the side dimension of these kernels to use to train the network. Note that increasing the kernel size will slow down training dramatically. Finding the ideal kernel size for a specific use case is often a matter of trial and error.
There may be memory errors when the input chips are too large (over 250px). This argument will downsample the side dimension of input chips to the specified number of pixels.
You need to install Docker.
Clone the repository:
git clone https://github.com/platformstories/train-cnn-chip-classifier
Then:
cd train-cnn-chip-classifier
docker build -t train-cnn-chip-classifier .
Create a container in interactive mode and mount the sample input under /mnt/work/input/
:
docker run -v full/path/to/sample-input:/mnt/work/input -it train-cnn-chip-classifier
Then, within the container:
python /train-cnn-chip-classifier.py
Watch the stdout to confirm that the model is being trained.
Login to Docker Hub:
docker login
Tag your image using your username and push it to DockerHub:
docker tag train-cnn-chip-classifier yourusername/train-cnn-chip-classifier
docker push yourusername/train-cnn-chip-classifier
The image name should be the same as the image name under containerDescriptors in train-cnn-chip-classifier.json.
Alternatively, you can link this repository to a Docker automated build. Every time you push a change to the repository, the Docker image gets automatically updated.
In a Python terminal:
from gbdxtools import Interface
gbdx = Interface()
gbdx.task_registry.register(json_filename='train-cnn-chip-classifier.json')
Note: If you change the task image, you need to reregister the task with a higher version number in order for the new image to take effect. Keep this in mind especially if you use Docker automated build.