Skip to content

Optimizers

Anand edited this page Feb 14, 2018 · 12 revisions

List of Optimizers

as in: Salimans, T., Ho, J., Chen, X. & Sutskever, I. Evolution Strategies as a Scalable Alternative to Reinforcement Learning. arXiv:1703.03864 [cs, stat] (2017).

In the pseudo code the algorithm does:

For n iterations do:
  • Perturb the current individual by adding a value with 0 mean and noise_std standard deviation

  • If mirrored sampling is enabled, also perturb the current individual by subtracting the same values that were added in the previous step

  • evaluate individuals and get fitness

  • Update the fitness as

    theta_{t+1} <- theta_t + alpha * sum{F_i * e_i} / (n * sigma^2)

    where F_i is the fitness and e_i is the perturbation

  • If fitness shaping is enabled, F_i is replaced with the utility u_i in the previous step, which is calculated as:

    u_i = max(0, log(n/2 + 1) - log(k)) / sum_{k=1}^{n}{max(0, log(n/2 + 1) - log(k))} - 1 / n

    where k and i are the indices of the individuals in descending order of fitness F_i

Fitness shaping as in the paper: Wierstra, D. et al. Natural Evolution Strategies. Journal of Machine Learning Research 15, 949–980 (2014).

In the pseudo code the algorithm does:

For n iterations do:
  1. Sample individuals from distribution
  2. evaluate individuals and get fitness
  3. pick rho * pop_size number of elite individuals
  4. Out of the remaining non-elite individuals, select them using a simulated-annealing style selection based on the difference between their fitness and the 1-rho quantile (gamma) fitness, and the current temperature
  5. Fit the distribution family to the new elite individuals by minimizing cross entropy. The distribution fitting is smoothed to prevent premature convergence to local minima. A weight equal to the smoothing parameter is assigned to the previous parameters when smoothing.

References

In the pseudo code the algorithm does:

For n iterations do:
  1. Sample individuals from distribution
  2. evaluate individuals and get fitness
  3. check if gamma or best individuals fitness increased
  4. if not increase population size by n_expand (if not yet max_pop_size else stop) and sample again (1) else set pop_size = min_pop_size and proceed
  5. pick n_elite individuals with highest fitness
  6. Out of the remaining non-elite individuals, select them using a simulated-annealing style selection based on the difference between their fitness and the 1-rho quantile (gamma) fitness, and the current temperature
  7. Fit the distribution family to the new elite individuals by minimizing cross entropy. The distribution fitting is smoothed to prevent premature convergence to local minima. A weight equal to the smoothing parameter is assigned to the previous parameters when smoothing.

References

In the pseudo code the algorithm does:

For n iterations do:
  1. Explore the fitness of individuals in the close vicinity of the current one
  2. Calculate the gradient based on these fitnesses.
  3. Create the new 'current individual' by taking a step in the parameters space along the direction of the largest ascent of the plane
  • Classic Gradient Descent
  • Stochastic Gradient Descent
  • ADAM
  • RMSProp

Parallel Tempering is a search algorithm, that uses multiple simulated annealing algorithms at the same time and has a certain chance of two annealing algorithms switching temperatures. Each of the annealing algorithms can have different cooling schedules and respective decay parameters or staring/ ending temperatures. This effectively has a similar functional effect, as a single simulated annealing with multiple coolings and reheatings, but needs fewer parameters (like when to reheat and how often). For details on simulated annealing, please read the documentation on it.

Note: For simplicity sake, not the positions, but the temperature and the schedule are swapped, which ammounts to the exact same. The temperature and the schedules are each stored in lists, which are both indexed by 'compare_indices'. If the swap criterion between two schedules are met, the respective entries for 'compare_indices' are swapped. To get the parallel runs, 'n_parallel_runs" is used - each individual is one of the parallel runs.

The algorithm does:

For n iterations and each cooling schedule do:
  1. Take a step of size noisy step in a random direction
  2. If it reduces the cost, keep the solution
  3. Otherwise keep with probability exp(- (f_new - f) / T)
  4. Swap positions between two randomly chosen schedules with probability exp(-((f_1 - f_2) * (1 / (k * T_1) - 1 / (k * T_2)))) with k being a constant

Additional details on the Simulated Annealing and Parallel Tempering page.

Multiplicative Monotonic Cooling

This schedule type multiplies the starting temperature by a factor that decreases over time (number k of the performed iteration steps). It requires a decay parameter (alpha) but not an ending temperature, as the prgression of the temperature is well definded by the decay parameter only. The Multiplicative Monotonic Cooling schedules are: Exponential multiplicative cooling, Logarithmical multiplicative cooling, Linear multiplicative cooling and Quadratic multiplicative cooling. Source: Kirkpatrick, Gelatt and Vecchi (1983)

Exponential multiplicative cooling

Default cooling schedule for typical applications of simulated annealing. Each step, the temperature T_k is multiplied by the factor alpha (which has to be between 0 and 1) or in other words it is the starting temperature T_0 multiplied by the factor alpha by the power of k: T_k = T_0 * alpha^k

Logarithmical multiplicative cooling

The factor by which the temperature decreases, is indirectly proportional to the log of k. Therefore it slows down the cooling, the further progressed the schedule is. Alpha has to be largert than one. T_k = T_0 / ( 1 + alpha* log (1 + k) )

Linear multiplicative cooling

Behaves similar to Logarithmical multiplicative cooling in that the decrease gets lower over time, but not as pronounced. The decrease is indirectly proportional to alpha times k and alpha has to be larger than zero: T_k = T_0 / ( 1 + alpha*k)

Quadratic multiplicative cooling

This schedule stays at high temperatures longer, than the other schedules and has a steeper cooling later in the process. Alpha has to be larger than zero. T_k = T_0 / ( 1 + alpha*k^2)

Additive Monotonic Cooling

The differences to Multiplicative Monotonic Cooling are, that the final temperature T_n and the number of iterations n are needed also. So this cannot be used as intended, if the stop criterion is something different, than a certain number of iteration steps. A decay parameter is not needed. Each temperature is computed, by adding a term to the final temperature. The Additive Monotonic Cooling schedules are: Linear additive cooling, Quadratic additive cooling, Exponential additive cooling and Trigonometric additive cooling. Source. Additive monotonic cooling B. T. Luke (2005)

Linear additive cooling

This schedule adds a term to the final temperature, which decreases linearily with the progression of the schedule. T_k = T_n + (T_0 -T_n)*((n-k)/n)

Quadratic additive cooling

This schedule adds a term to the final temperature, which decreases q uadratically with the progression of the schedule. T_k = T_n + (T_0 -T_n)*((n-k)/n)^2

Exponential additive

Uses a complicated formula, to come up with a schedule, that has a slow start, a steep decrease in temperature in the middle and a slow decrease at the end of the process. T_k = T_n + (T_0 - T_n) * (1/(1+exp( 2*ln(T_0 - T_n)/n * (k- n/2) ) ) )

Trigonometric additive cooling

This schedule has a similar behavior as Exponential additive, but less pronounced. T_k = T_n + (T_0 - T_n)/2 * (1+cos(k*pi/n))

In the pseudo code the algorithm does:

For n iterations do:
  1. Take a step of size noisy step in a random direction
  2. If it reduces the cost, keep the solution
  3. Otherwise keep with probability exp(- (f_new - f) / T)

Additional details on the Simulated Annealing and Parallel Tempering page.

If the n_parallel_runs parameter is set to be a value larger than one in this case, it just runs multiple independent simulated annealing runs in parallel with different initial points (provided your create_individual function doesn't return the same individual always).

Same as above

Clone this wiki locally