This repository provides an implementation of the 1cycle learning rate policy as originally described in the paper: Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates [1]. In addition, it includes a reproduction of the published results on MNIST and new experiments on CIFAR10.
What's in the box?
- Implementation of the 1cycle learning rate policy.
- Port of the LeNet model which ships with Caffe to keras.
- Implementation of another simple 3 layer net.
- Experiments which reproduce the published result on MNIST and new experiments on CIFAR10.
The experiments performed in this repository were conducted on an Ubuntu 18.04 paperspace instance with a Nvidia Quadro P4000 GPU, NVIDIA Driver: 410.48, CUDA 10.0.130-1.
-
git clone git@github.com:coxy1989/superconv.git
-
cd superconv
-
conda env create -f environment.yml
-
source activate superconv
-
jupyter notebook
If you'd like to run the CIFAR10 experiments you can download the tfrecord files used in training from my website by running the get_data.sh script in the /data folder
- Experiments - reproduce raw results.
- Results - calculate run averages and plot figures.
The result below confirms that superconvergence can be observed with a standard configuration and the simple LeNet network architecture.
LR/SS/PL | CM/SS | Epochs | Accuracy (%) |
---|---|---|---|
0.01/inv | 0.9 | 85 | 98.92 |
0.01/rop | 0.9 | 85 | 98.85 |
0.01-0.1/5 | 0.95-0.8/5 | 12 | 99.05 |
0.01-0.1/12 | 0.95-0.8/12 | 25 | 99.01 |
0.01-0.1/23 | 0.95-0.8/23 | 50 | 99.02 |
0.02-0.2/40 | 0.95-0.8/40 | 85 | 99.07 |
Table 1: Final accuracy for the MNIST dataset using the LeNet architecture with weight decay of 0.0005
and batch size of 512
. Reported final accuracy is an average of 5 runs. LR = learning rate, SS = stepsize in epochs, where two steps comprise a cycle. CM = cyclical momentum, 'inv' is the inv caffe policy, 'rop' is the reduce on plateau keras policy.
Plot 1: Accuracy vs epoch for the CLR(12)
, CLR(85)
, INV
and ROP
results in the preceeding table.
Results on CIFAR10 were not included in the original paper. The result below demonstrates superconvergence is not observed with a standard configuration and simple 3 layer network. I suspect that tuning of the other hyperparamaters is required, since it can be demonstrated that rapid convergence on this dataset is achievable with the related CLR policy and a similar network architecture. More experimentation is required here, feel free to send a pull request if you perform further experiments.
LR/SS/PL | CM/SS | Epochs | Accuracy (%) |
---|---|---|---|
0.01/inv | 0.9 | 85 | 79.00 |
0.01/rop | 0.9 | 85 | 80.11 |
0.01-0.1/5 | 0.95-0.8/5 | 12 | 78.65 |
0.01-0.1/12 | 0.95-0.8/12 | 25 | 78.38 |
0.01-0.1/23 | 0.95-0.8/23 | 50 | 78.15 |
0.02-0.2/40 | 0.95-0.8/40 | 85 | 78.05 |
Table 2: Final accuracy on the CIFAR10 dataset with a simple 3 layer architecture, weight decay of 0.003
and batch size of 128
. Reported final accuracy is an average of 5 runs. LR = learning rate, SS = stepsize in epochs, where two steps comprise a cycle. CM = cyclical momentum, 'inv' is the inv caffe policy, 'rop' is the reduce on plateau keras policy.
Plot 2: Accuracy vs epoch for the CLR(12)
, CLR(85)
, INV
and ROP
results in the preceeding table.
[1] Leslie N. Smith. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. arXiv:1708.07120, 2017.