-
Notifications
You must be signed in to change notification settings - Fork 491
Multinode googlenet
This is a part of Multi-node guide. It is assumed you have completed the cluster configuration and Caffe build tutorials.
The tutorial explains how to train GoogLeNet. It extends CIFAR10 tutorial, so please complete CIFAR10 tutorial first in case you haven't done it yet.
You can use either LMDB, compressed LMDB or images in order to specify the ImageNet data set.
If you have chosen to use images, configure image data layer in train_val
proto configuration (i.e. models/bvlc_googlenet/train_val_client.prototxt
). You need to replace default data layer in GoogleNet model:
name: "GoogleNet"
layer {
name: "data"
type: "ImageData"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: true
crop_size: 224
mean_value: 104
mean_value: 117
mean_value: 123
}
image_data_param {
source: "data/ilsvrc12/train.txt"
batch_size: 512
shuffle: true
}
}
layer {
name: "data"
type: "ImageData"
top: "data"
top: "label"
include {
phase: TEST
}
transform_param {
mirror: false
crop_size: 224
mean_value: 104
mean_value: 117
mean_value: 123
}
image_data_param {
source: "data/ilsvrc12/val.txt"
batch_size: 50
new_width: 256
new_height: 256
}
}
Define the type of data layer as image_data_param
. The "data/ilsvrc12/train.txt"
will contain list of all images used for training together with class id. The shuffle
parameter ensures that each node shuffles the images from full training set with a dfferent seed. You can train GoogLeNet with total batch size of 1024 and learning rate equal to 0.06, so set the batch of individual node to a value B/K
, where B
is the expected total batch size (1024 here) and K
is number of nodes you want to train with. Here, 2 nodes are assumed.
The solver definition in models/bvlc_googlenet/solver_client.prototxt
should have update learning rate and maximum iteration:
net: "models/bvlc_googlenet/train_val_client.prototxt"
test_interval: 1000
test_iter: 1000
test_initialization: false
display: 40
average_loss: 40
base_lr: 0.06
lr_policy: "poly"
power: 0.5
max_iter: 91000
momentum: 0.9
weight_decay: 0.0002
solver_mode: CPU
snapshot: 10000
snapshot_prefix: "multinode_googlenet"
All you have to do now, is to run training:
mpirun --hostfile path/to/hostfile -n 2 ./build/tools/caffe train \
--solver=models/bvlc_googlenet/quick_solver.prototxt
To run with lmdb change a few things in train_val_client.prototxt
:
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: true
crop_size: 224
mean_value: 104
mean_value: 117
mean_value: 123
}
data_param {
source: "/home/data/lmdb_compressed/ilsvrc12_train_lmdb"
shuffle: true
batch_size: 512
backend: LMDB
}
}
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TEST
}
transform_param {
mirror: false
crop_size: 224
mean_value: 104
mean_value: 117
mean_value: 123
}
data_param {
source: "/home/data/lmdb_compressed/ilsvrc12_val_lmdb"
batch_size: 50
backend: LMDB
}
}
The uncompressed LMDB works the same as compressed, although the compressed should be faster.