Skip to content

Latest commit

 

History

History
30 lines (16 loc) · 1.82 KB

11.6_momentum.md

File metadata and controls

30 lines (16 loc) · 1.82 KB
  • In mini-batch gradient descent, we compute the mean gradient of training examples within each mini-batch as the negative direction to update our network parameters at each time-step iteration.
        

Mini-Batch stochastic gradient descent with Momentum

  • Instead of using only the gradient of the current step to determine the direction to go, we can replace gt with vt to better optimize the parameter searching process.
    Here vt is called momentum.
        

Momentum ( v ) : a leaky average over past gradients

  • Momentum accumulates the gradients of the past and current steps to determine the direction to go next.

        

  • When β = 0, we recover the regular gradient descent.

  • With smaller β, we only have a slight correction w.r.t. the regular gradient method. The learning curve adopts more quickly to the current change.

  • With higher β, we have more contributions coming from the past steps. The learning curve would be smoother due to averaging over a larger window of the past gradient history.

  • β controls the amount of history (momentum) to include in the update. We are approximately averaging over the last 1/(1-β) gradients, due to exponentiated downweighting of past data.

  • Most of the time, stochastic gradient descent with momentum works better and faster than regular stochastic gradient descent.

  • Implementation in Pytorch : torch.optim.SGD(params, lr, momentum).