Exp1 looks at the effect of sequential layers with jacobains that are constant for the batch, such is the case for sequential Dense and Conv layers. The model multiplies intermediate jacobians to allow an increase in parallelism. The use case of this is very niche.
The best way to explain is with an example:
Our model Is a basic MLP and it’s spilt across 2 devices via model parallelism:
Where
The training process is made up of 3 steps. 1. Forward Pass 2. Backward Pass 3. Parameter Update.
The forward pass is the process of passing an input through the model and then the model output to the Loss function
The backward pass is the process used to calculate
Finally, once all the gradients have been calculated you update each parameter accordingly with an SGD based algorithm.
As we backward pass through each layer, each layer gets passed
As our model is split across 2 devices, rank 0 computes from
For the backward pass, rank 1 computes
Then each rank can upgrade its local parameters in parallel.
Visually this looks like this:
As you can see there is no compute overlap. The goal of out technique is to start the compute of the backward pass on rank 0 as fast as possible.
We do this by not computing
During the stage 1
Stage 2 starts once stage 1 has finished on that device. For each layer in parallel
Visually this new technique looks like this:
This new technique can be less memory efficient based on the operation taking place in each layer, if the number of elements in the tensor of
Unfortunately, Pytorch autograd doesn’t give us an easy way to implement this so we must implement backpropagation from scratch.
Exp 3 looks into the technique introduced in Exp 2 but on top of pipeline parallelism to further increase parallelism, and provide a more realistic use case.
Exp 4 goes a step above Exp 3 by combining the pipeline parallelism with tensor parallelism and and data parallelism to replicate an even more realistic use case.