11.6 Momentum

In mini-batch gradient descent, we compute the mean gradient of training examples within each mini-batch as the negative direction to update our network parameters at each time-step iteration.

Instead of using only the gradient of the current step to determine the direction to go, we can replace g_t with v_t to better optimize the parameter searching process.
Here v_t is called momentum.

Momentum accumulates the gradients of the past and current steps to determine the direction to go next.
When β = 0, we recover the regular gradient descent.
With smaller β, we only have a slight correction w.r.t. the regular gradient method. The learning curve adopts more quickly to the current change.
With higher β, we have more contributions coming from the past steps. The learning curve would be smoother due to averaging over a larger window of the past gradient history.
β controls the amount of history (momentum) to include in the update. We are approximately averaging over the last 1/(1-β) gradients, due to exponentiated downweighting of past data.
Most of the time, stochastic gradient descent with momentum works better and faster than regular stochastic gradient descent.
Implementation in Pytorch : torch.optim.SGD(params, lr, momentum).

Provide feedback