Optimizers

List of Optimizers

Table of Contents

Evolution Strategies
Cross Entropy
Fully Adaptive Cross Entropy (FACE)
Gradient Descent
- Available variants
Parallel Tempering
- Available cooling schedules
Simulated Annealing
- Available cooling schedules

Evolution Strategies

as in: Salimans, T., Ho, J., Chen, X. & Sutskever, I. Evolution Strategies as a Scalable Alternative to Reinforcement Learning. arXiv:1703.03864 [cs, stat] (2017).

In the pseudo code the algorithm does:

For n iterations do:

Perturb the current individual by adding a value with 0 mean and noise_std standard deviation

If mirrored sampling is enabled, also perturb the current individual by subtracting the same values that were added in the previous step

evaluate individuals and get fitness

Update the fitness as

theta_{t+1} <- theta_t + alpha * sum{F_i * e_i} / (n * sigma^2)

where F_i is the fitness and e_i is the perturbation

If fitness shaping is enabled, F_i is replaced with the utility u_i in the previous step, which is calculated as:

u_i = max(0, log(n/2 + 1) - log(k)) / sum_{k=1}^{n}{max(0, log(n/2 + 1) - log(k))} - 1 / n

where k and i are the indices of the individuals in descending order of fitness F_i

Fitness shaping as in the paper: Wierstra, D. et al. Natural Evolution Strategies. Journal of Machine Learning Research 15, 949–980 (2014).

Cross Entropy

In the pseudo code the algorithm does:

For n iterations do:

Sample individuals from distribution

evaluate individuals and get fitness

pick rho * pop_size number of elite individuals

Out of the remaining non-elite individuals, select them using a simulated-annealing style selection based on the difference between their fitness and the 1-rho quantile (gamma) fitness, and the current temperature

Fit the distribution family to the new elite individuals by minimizing cross entropy. The distribution fitting is smoothed to prevent premature convergence to local minima. A weight equal to the smoothing parameter is assigned to the previous parameters when smoothing.

References

Fully Adaptive Cross Entropy (FACE)

In the pseudo code the algorithm does:

For n iterations do:

Sample individuals from distribution

evaluate individuals and get fitness

check if gamma or best individuals fitness increased

if not increase population size by n_expand (if not yet max_pop_size else stop) and sample again (1) else set pop_size = min_pop_size and proceed

pick n_elite individuals with highest fitness

Out of the remaining non-elite individuals, select them using a simulated-annealing style selection based on the difference between their fitness and the 1-rho quantile (gamma) fitness, and the current temperature

Fit the distribution family to the new elite individuals by minimizing cross entropy. The distribution fitting is smoothed to prevent premature convergence to local minima. A weight equal to the smoothing parameter is assigned to the previous parameters when smoothing.

References

Gradient Descent

In the pseudo code the algorithm does:

For n iterations do:

Explore the fitness of individuals in the close vicinity of the current one

Calculate the gradient based on these fitnesses.

Create the new 'current individual' by taking a step in the parameters space along the direction of the largest ascent of the plane

Available variants

Classic Gradient Descent
Stochastic Gradient Descent
ADAM
RMSProp

Parallel Tempering

Parallel Tempering is a search algorithm, that uses multiple simulated annealing algorithms at the same time and has a certain chance of two annealing algorithms switching temperatures. Each of the annealing algorithms can have different cooling schedules and respective decay parameters or staring/ ending temperatures. This effectively has a similar functional effect, as a single simulated annealing with multiple coolings and reheatings, but needs fewer parameters (like when to reheat and how often). For details on simulated annealing, please read the documentation on it.

Note: For simplicity sake, not the positions, but the temperature and the schedule are swapped, which ammounts to the exact same. The temperature and the schedules are each stored in lists, which are both indexed by 'compare_indices'. If the swap criterion between two schedules are met, the respective entries for 'compare_indices' are swapped. To get the parallel runs, 'n_parallel_runs" is used - each individual is one of the parallel runs.

The algorithm does:

For n iterations and each cooling schedule do:

Take a step of size noisy step in a random direction

If it reduces the cost, keep the solution

Otherwise keep with probability exp(- (f_new - f) / T)

Swap positions between two randomly chosen schedules with probability exp(-((f_1 - f_2) * (1 / (k * T_1) - 1 / (k * T_2)))) with k being a constant

Additional details on the Simulated Annealing and Parallel Tempering page.

Available cooling schedules

Multiplicative Monotonic Cooling

This schedule type multiplies the starting temperature by a factor that decreases over time (number k of the performed iteration steps). It requires a decay parameter (alpha) but not an ending temperature, as the prgression of the temperature is well definded by the decay parameter only. The Multiplicative Monotonic Cooling schedules are: Exponential multiplicative cooling, Logarithmical multiplicative cooling, Linear multiplicative cooling and Quadratic multiplicative cooling. Source: Kirkpatrick, Gelatt and Vecchi (1983)

Exponential multiplicative cooling

Default cooling schedule for typical applications of simulated annealing. Each step, the temperature T_k is multiplied by the factor alpha (which has to be between 0 and 1) or in other words it is the starting temperature T_0 multiplied by the factor alpha by the power of k: T_k = T_0 * alpha^k

Logarithmical multiplicative cooling

The factor by which the temperature decreases, is indirectly proportional to the log of k. Therefore it slows down the cooling, the further progressed the schedule is. Alpha has to be largert than one. T_k = T_0 / ( 1 + alpha* log (1 + k) )

Linear multiplicative cooling

Behaves similar to Logarithmical multiplicative cooling in that the decrease gets lower over time, but not as pronounced. The decrease is indirectly proportional to alpha times k and alpha has to be larger than zero: T_k = T_0 / ( 1 + alpha*k)

Quadratic multiplicative cooling

This schedule stays at high temperatures longer, than the other schedules and has a steeper cooling later in the process. Alpha has to be larger than zero. T_k = T_0 / ( 1 + alpha*k^2)

Additive Monotonic Cooling

The differences to Multiplicative Monotonic Cooling are, that the final temperature T_n and the number of iterations n are needed also. So this cannot be used as intended, if the stop criterion is something different, than a certain number of iteration steps. A decay parameter is not needed. Each temperature is computed, by adding a term to the final temperature. The Additive Monotonic Cooling schedules are: Linear additive cooling, Quadratic additive cooling, Exponential additive cooling and Trigonometric additive cooling. Source. Additive monotonic cooling B. T. Luke (2005)

Linear additive cooling

This schedule adds a term to the final temperature, which decreases linearily with the progression of the schedule. T_k = T_n + (T_0 -T_n)*((n-k)/n)

Quadratic additive cooling

This schedule adds a term to the final temperature, which decreases q uadratically with the progression of the schedule. T_k = T_n + (T_0 -T_n)*((n-k)/n)^2

Exponential additive

Uses a complicated formula, to come up with a schedule, that has a slow start, a steep decrease in temperature in the middle and a slow decrease at the end of the process. T_k = T_n + (T_0 - T_n) * (1/(1+exp( 2*ln(T_0 - T_n)/n * (k- n/2) ) ) )

Trigonometric additive cooling

This schedule has a similar behavior as Exponential additive, but less pronounced. T_k = T_n + (T_0 - T_n)/2 * (1+cos(k*pi/n))

Simulated Annealing

In the pseudo code the algorithm does:

For n iterations do:

Take a step of size noisy step in a random direction

If it reduces the cost, keep the solution

Otherwise keep with probability exp(- (f_new - f) / T)

Additional details on the Simulated Annealing and Parallel Tempering page.

If the n_parallel_runs parameter is set to be a value larger than one in this case, it just runs multiple independent simulated annealing runs in parallel with different initial points (provided your create_individual function doesn't return the same individual always).

Available cooling schedules

Same as above

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizers

List of Optimizers

Evolution Strategies

Cross Entropy

Fully Adaptive Cross Entropy (FACE)

Gradient Descent

Available variants

Parallel Tempering

Available cooling schedules

Multiplicative Monotonic Cooling

Exponential multiplicative cooling

Logarithmical multiplicative cooling

Linear multiplicative cooling

Quadratic multiplicative cooling

Additive Monotonic Cooling

Linear additive cooling

Quadratic additive cooling

Exponential additive

Trigonometric additive cooling

Simulated Annealing

Available cooling schedules

Clone this wiki locally