An online Google Colab notebook is available here.
Me: https://github.com/tisu19021997
- In short, the Cyclical LR method allows the learning rate of a model to vary between two boundaries when training. By that, it provides substantial improvements in performance for different architectures. Cyclical LR divides the training phase into cycles and each cycle consists of 2 steps.
- The 1-Cycle policy uses the cyclical LR method but only with 1 cycle for the whole training. Moreover, this policy suggests that "always use one cycle that is smaller than the total number of iterations/epochs and allow the learning rate to decrease several orders of magnitude less than the initial learning rate for the remaining iterations".
- There are 2 variations of 1-Cycle policy that I found when doing my research:
- In the first variation, the learning rate varies in 3 states:
- from
base_lr
tomax_lr
- from
max_lr
tobase_lr
- from
base_lr
tomin_lr
(wheremin_lr=base_lr/some_factor
)
- from
- In the second variation (which I am using here), the learning rate varies in 2 states:
- from
base_lr
tomax_lr
- from
max_lr
tomin_lr
- from
- In the first variation, the learning rate varies in 3 states:
- Leslie N.Smith, "Cyclical Learning Rates for Training Neural Networks" (2015)
- Leslie N.Smith and Nicholay Topin "Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates" (2017)
- Leslie N.Smith, "A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay" (2018)
- Frank Hutter et al., "SGDR: Stochastic Gradient Descent with Warm restarts" (2017)