-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensor parallel distributed strategy without using deepspeed #280
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big patch and reviewing it further
@@ -1013,6 +1161,7 @@ def forward( | |||
global has_fused_rope | |||
has_fused_rope = False | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: please remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: please remove this
Done
[GaudiLlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)] | ||
) | ||
layers = [] | ||
for i in range(config.num_hidden_layers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: layer_idx in place of 'i'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: layer_idx in place of 'i'
Done
import torch.distributed | ||
from torch import nn | ||
|
||
#from optimum.habana.distributed import tp_wrapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: please remove the commented code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: please remove the commented code
Done
pass | ||
|
||
|
||
class NotDistributed(DistributedStrategy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why derived class is NotDistributed for the base class as DistributedStrategy? It is creating some confusion in readability, may require some other name?
def distribute_layer(self, block: nn.Module, layer: int) -> nn.Module: | ||
device = self.layer_to_device[layer] | ||
if self.from_meta: | ||
# https://github.com/pytorch/pytorch/pull/113647 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR is closed, and we can possibly remove the reference to such comments from foundation repo, #Comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
) | ||
if par_mod.bias is not None: | ||
par_mod.bias.copy_(torch.split(mod.bias, output_size_per_partition)[rank]) | ||
# print(f"For rank {rank}, we have the following weights: Base weight {mod.weight} bias {mod.bias}; Par weight {par_mod.weight}, bias {par_mod.bias}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#Comment
par_mod.bias.copy_(mod.bias) | ||
else: | ||
par_mod.bias.zero_() | ||
# print(f"For rank {rank}, we have the following weights: Base weight {mod.weight}, bias {mod.bias}; Par weight {par_mod.weight}, bias {par_mod.bias}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#Comment
par_mod.weight.copy_( | ||
torch.split(mod.weight, output_size_per_partition, dim=1)[rank] | ||
) | ||
# print(f"For rank {rank}, we have the following weights: Base weight {mod.weight} bias {mod.bias}; Par weight {par_mod.weight}, bias {par_mod.bias}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#Comment
# The transposes here are to avoid excessive recompilation due to split() | ||
# specializing the dimension where the all_gather is happening | ||
last_dim = input_.dim() - 1 | ||
# Starting PT 2.3, we can go back to funcol.all_gather_tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#Comment
576860d
to
1e82fac
Compare
…I#280) * TP reference - ibm foundation-model-stack * Code cleanup -removed unused code --------- Co-authored-by: Kalyan <kkumar@habana.ai>
…abanaAI#280) (HabanaAI#299)" This reverts commit 32c86d3.
…abanaAI#280)" This reverts commit c6e5f9c.
* Revert "Tensor parallel distributed strategy without using deepspeed (#280) (#299)" This reverts commit 32c86d3. * Tensor parallel distributed strategy without using deepspeed (huggingface#1121) Co-authored-by: Kalyan <kkumar@habana.ai> --------- Co-authored-by: Kalyan <kkumar@habana.ai>
* Revert "Tensor parallel distributed strategy without using deepspeed (#280)" This reverts commit c6e5f9c. * Tensor parallel distributed strategy without using deepspeed (huggingface#1121) Co-authored-by: Kalyan <kkumar@habana.ai> --------- Co-authored-by: Kalyan <kkumar@habana.ai>
* Revert "Tensor parallel distributed strategy without using deepspeed (#280)" This reverts commit c6e5f9c. * Tensor parallel distributed strategy without using deepspeed (huggingface#1121) Co-authored-by: Kalyan <kkumar@habana.ai> --------- Change-Id: Ic30c85e697dbd6a51767e21e1c06c9a20120d9f6 Co-authored-by: Kalyan <kkumar@habana.ai>
* Revert "Tensor parallel distributed strategy without using deepspeed (#280)" This reverts commit c6e5f9c. * Tensor parallel distributed strategy without using deepspeed (huggingface#1121) Co-authored-by: Kalyan <kkumar@habana.ai> --------- Change-Id: Ic30c85e697dbd6a51767e21e1c06c9a20120d9f6 Co-authored-by: Kalyan <kkumar@habana.ai>
Tensor parallel by extending GaudiLlamaAttention -> TPGaudiLlamaAttention and GaudiLlamaMLP -> TPGaudiLlamaMLP
use parameter --distributed_strategy="tp" to invoke this code path