how to use all_gather in training loop? #2504

kkarrancsu · 2022-03-07T19:39:21Z

I have defined my train_step in the exact same way as in the cifar10 example. Is it possible to gather all of the predictions before computing the loss? I haven't seen examples of this pattern in the ignite examples (maybe I'm missing it?), but for my application, it is more optimal to compute the loss after aggregating the forward passes and targets run on multiple GPU's. This only matters when using DistributedDataParallel, since DataParallel automatically aggregates the outputs.

I see the idist.all_gather() function, but am unclear how to use it in a training loop.

The text was updated successfully, but these errors were encountered:

sdesrozis · 2022-03-07T22:00:18Z

@kkarrancsu Thanks for your answer.

In general, idist.all_gather() can be used as long as the call is made collectively by all the processes. Therefore, you can use this method to gather the predictions in your training loop.

I can provide an example asap and maybe update the doc accordingly.

However, I'm not completely sure about your question. In fact, if you want to compute predictions in ddp, gather and back propagate from one proc, it won't work. You can check the internal design https://pytorch.org/docs/stable/notes/ddp.html#internal-design

kkarrancsu · 2022-03-07T22:05:33Z

@sdesrozis Thanks for your quick reply! Sorry if my initial question was unclear. As an example:

m = model()
m_dp = nn.DataParallel(m)
m_ddp = nn.DistributedDataParallel(m)

x = input # [batch_size, ...]
y_dp = m_dp(x)  # [batch_size, ...]
y_ddp = m_ddp(x) # [batch_size/ngpu, ...]

I'd like to gather all the y_ddp from all gpu's before computing a loss. I hope that makes the question clear?

sdesrozis · 2022-03-07T22:40:08Z

Thanks for the clarification. Would you like to use the loss as a metric ? Or would you want to call loss.backward() ?

kkarrancsu · 2022-03-07T22:45:49Z

I'd like to call loss.backward()

sdesrozis · 2022-03-08T06:39:30Z

Ok so I think it won't work even if you gather the predictions. The gathering operation is not an autodiff function so it will cut the graph computation. The forward pass creates some internal states that won't be gathered too.

Although I'm pretty sure that is answered in the PyTorch forum. Maybe I'm wrong though and I would be interested by a few discussions about this topic.

EDIT see here https://amsword.medium.com/gradient-backpropagation-with-torch-distributed-all-gather-9f3941a381f8

kkarrancsu · 2022-03-08T14:16:16Z

@sdesrozis Thanks - I will investigate based on your link and report back.

sdesrozis · 2022-03-08T14:22:09Z

@sdesrozis Thanks - I will investigate based on your link and report back.

Good ! Although I’m doubtful about the link… Interesting by your feedback.

vfdev-5 · 2022-03-08T14:27:19Z

@kkarrancsu can you provide a bit more details on what exactly you would like to do ?
In DDP, data is distributed to N processes and model is cloned. When we do the forward pass each process obtains predictions y_preds = m_ddp(x) on its data chunk and using a loss function and loss.backward() we can compute gradients that are finally sum up and applied to the model internally by pytorch DDP model wrapper.

As for distributed autograd, you can check as well : https://pytorch.org/docs/stable/rpc.html#distributed-autograd-framework

kkarrancsu · 2022-03-08T15:24:13Z

Hi @vfdev-5, sure.

We are using the Supervised Contrastive loss to train an embedding. In Eq. 2 of the paper, we see that the loss depends on the number of samples used to compute it (positive and negative).

My colleague suggested to me that it is more optimal to compute the loss considering all examples (the entire batch), rather than considering batch/ngpu samples (which is what would happen when using DDP and computing loss locally to each GPU). This is because the denominator in SupConLoss is computing the loss of negative samples, and by first aggregating all of the negative samples across gpus, you would get a more accurate loss.

sdesrozis · 2022-03-08T19:47:21Z

Ok I understand. You should have a look to a distributed implementation of SimCLR. See for instance

https://github.com/Spijkervet/SimCLR/blob/cd85c4366d2e6ac1b0a16798b76ac0a2c8a94e58/simclr/modules/nt_xent.py#L7

This might give you some inspiration.

lxysl · 2023-09-02T14:55:30Z

Ok I understand. You should have a look to a distributed implementation of SimCLR. See for instance

https://github.com/Spijkervet/SimCLR/blob/cd85c4366d2e6ac1b0a16798b76ac0a2c8a94e58/simclr/modules/nt_xent.py#L7

This might give you some inspiration.

This code is not so correct. Please check this issue: Spijkervet/SimCLR#30 and my pr: Spijkervet/SimCLR#46.

kkarrancsu added the question label Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to use all_gather in training loop? #2504

how to use all_gather in training loop? #2504

kkarrancsu commented Mar 7, 2022 •

edited

Loading

sdesrozis commented Mar 7, 2022

kkarrancsu commented Mar 7, 2022 •

edited

Loading

sdesrozis commented Mar 7, 2022

kkarrancsu commented Mar 7, 2022

sdesrozis commented Mar 8, 2022 •

edited

Loading

kkarrancsu commented Mar 8, 2022

sdesrozis commented Mar 8, 2022

vfdev-5 commented Mar 8, 2022 •

edited

Loading

kkarrancsu commented Mar 8, 2022

sdesrozis commented Mar 8, 2022

lxysl commented Sep 2, 2023

how to use all_gather in training loop? #2504

how to use all_gather in training loop? #2504

Comments

kkarrancsu commented Mar 7, 2022 • edited Loading

sdesrozis commented Mar 7, 2022

kkarrancsu commented Mar 7, 2022 • edited Loading

sdesrozis commented Mar 7, 2022

kkarrancsu commented Mar 7, 2022

sdesrozis commented Mar 8, 2022 • edited Loading

kkarrancsu commented Mar 8, 2022

sdesrozis commented Mar 8, 2022

vfdev-5 commented Mar 8, 2022 • edited Loading

kkarrancsu commented Mar 8, 2022

sdesrozis commented Mar 8, 2022

lxysl commented Sep 2, 2023

kkarrancsu commented Mar 7, 2022 •

edited

Loading

kkarrancsu commented Mar 7, 2022 •

edited

Loading

sdesrozis commented Mar 8, 2022 •

edited

Loading

vfdev-5 commented Mar 8, 2022 •

edited

Loading