-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support qwen2 hf<->mcore ckpt converter #1290
base: main
Are you sure you want to change the base?
Conversation
Hi @wenyujin333, could you please rebase the MR onto main branch to resolve the conflicts? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work and contribution to Megatron-LM.
I think there are some aspects that could help this MR merged into MCore.
- It can be very helpful for users by adding a section of documentation to introduce the workflow to use HF<->MCore converter for Qwen models, like https://github.com/NVIDIA/Megatron-LM/tree/main/examples/mixtral
- From the MCore developer's point of view, some codes are kind of difficulty to maintain since the code is complex and not very clear to follow like in
saver_qwen2_hf.py
. Perhaps we could restructure some sections to make the logic flow more apparent.
tools/checkpoint/loader_mcore.py
Outdated
|
||
# Dense modules | ||
for tp_rank, model in enumerate(models[0]): | ||
layer = get_transformer_block(model).layers[layer_num] | ||
qkv_weight.append(layer.self_attention.linear_qkv.weight.data) | ||
dense_weight.append(layer.self_attention.linear_proj.weight.data) | ||
if md.linear_bias: | ||
qkv_bias.append(layer.self_attention.linear_qkv.bias.data) | ||
elif md.add_qkv_bias: | ||
qkv_bias.append(layer.self_attention.linear_qkv.bias.data) | ||
shared_expert_mlp_l0_weight.append(layer.mlp.shared_experts.linear_fc1.weight.data) | ||
shared_expert_mlp_l1_weight.append(layer.mlp.shared_experts.linear_fc2.weight.data) | ||
|
||
layer = get_transformer_block(models[0][0]).layers[layer_num] | ||
router_weight = layer.mlp.router.weight.data | ||
shared_expert_gate_weight = layer.mlp.shared_experts.gate_weight.data | ||
|
||
# MoE modules | ||
num_experts_per_rank = margs.num_experts // ep_size | ||
for ep_rank, tp_models in enumerate(models): | ||
for tp_rank, model in enumerate(tp_models): | ||
layer = get_transformer_block(model).layers[layer_num] | ||
for local_expert_idx in range(num_experts_per_rank): | ||
expert_idx = int(ep_rank * num_experts_per_rank + local_expert_idx) | ||
mlp_l0_weight_list[expert_idx].append(layer.mlp.experts.local_experts[local_expert_idx].linear_fc1.weight.data) | ||
mlp_l1_weight_list[expert_idx].append(layer.mlp.experts.local_experts[local_expert_idx].linear_fc2.weight.data) | ||
if md.linear_bias: | ||
mlp_l0_bias_list[expert_idx].append(layer.mlp.experts.local_experts[local_expert_idx].linear_fc1.bias.data) | ||
|
||
if md.linear_bias: | ||
# Get non-parallel tensors from tp_rank 0 | ||
layer = get_transformer_block(tp_models[0]) | ||
for local_expert_idx in range(num_experts_per_rank): | ||
expert_idx = int(ep_rank * num_experts_per_rank + local_expert_idx) | ||
mlp_l1_bias_list[expert_idx].append(layer.mlp.experts.local_experts[local_expert_idx].linear_fc2.bias.data) | ||
|
||
mlp_l0_weight_w_list = [[] for _ in range(margs.num_experts)] | ||
mlp_l0_weight_v_list = [[] for _ in range(margs.num_experts)] | ||
# Concat along the tensor parallel dimension | ||
for expert_idx in range(margs.num_experts): | ||
mlp_l0_weight = mlp_l0_weight_list[expert_idx] | ||
if md.swiglu: | ||
for tp_rank in range(tp_size): | ||
mlp_l0_weight[tp_rank] = torch.chunk(mlp_l0_weight[tp_rank], 2, dim=0) | ||
mlp_l0_weight_w_list[expert_idx] = torch.cat([w[0] for w in mlp_l0_weight], dim=0) | ||
mlp_l0_weight_v_list[expert_idx] = torch.cat([w[1] for w in mlp_l0_weight], dim=0) | ||
else: | ||
mlp_l0_weight_list[expert_idx] = torch.cat(mlp_l0_weight, dim=0) | ||
mlp_l1_weight_list[expert_idx] = torch.cat(mlp_l1_weight_list[expert_idx], dim=1) | ||
|
||
# Stack along the expert parallel dimension | ||
if md.swiglu: | ||
message["mlp l0 weight W"] = torch.stack(mlp_l0_weight_w_list) | ||
message["mlp l0 weight V"] = torch.stack(mlp_l0_weight_v_list) | ||
for tp_rank in range(tp_size): | ||
shared_expert_mlp_l0_weight[tp_rank] = torch.chunk(shared_expert_mlp_l0_weight[tp_rank], 2, dim=0) | ||
message["shared mlp l0 weight W"] = torch.cat([w[0] for w in shared_expert_mlp_l0_weight], dim=0) | ||
message["shared mlp l0 weight V"] = torch.cat([w[1] for w in shared_expert_mlp_l0_weight], dim=0) | ||
else: | ||
message["mlp l0 weight"] = torch.stack(mlp_l0_weight_list) | ||
message["shared mlp l0 weight"] = torch.cat(shared_expert_mlp_l0_weight, dim=0) | ||
message["shared mlp l1 weight"] = torch.cat(shared_expert_mlp_l1_weight, dim=1) | ||
message["mlp l1 weight"] = torch.stack(mlp_l1_weight_list) | ||
|
||
# Concat along TP and stack along EP to biases | ||
if md.linear_bias: | ||
mlp_l0_bias_w_list = [[] for _ in range(margs.num_experts)] | ||
mlp_l0_bias_v_list = [[] for _ in range(margs.num_experts)] | ||
# Concat along the tensor parallel dimension | ||
for expert_idx in range(margs.num_experts): | ||
mlp_l0_bias = mlp_l0_bias_list[expert_idx] | ||
if md.swiglu: | ||
for tp_rank in range(tp_size): | ||
mlp_l0_bias[tp_rank] = torch.chunk(mlp_l0_bias[tp_rank], 2, dim=0) | ||
mlp_l0_bias_w_list[expert_idx] = torch.cat([w[0] for w in mlp_l0_bias], dim=0) | ||
mlp_l0_bias_v_list[expert_idx] = torch.cat([w[1] for w in mlp_l0_bias], dim=0) | ||
else: | ||
mlp_l0_bias_list[expert_idx] = torch.cat(mlp_l0_bias, dim=0) | ||
assert len(mlp_l1_bias_list[expert_idx]) == 1 | ||
mlp_l1_bias_list[expert_idx] = mlp_l1_bias_list[expert_idx][0] | ||
|
||
# Stack along the expert parallel dimension | ||
if md.swiglu: | ||
message["mlp l0 bias W"] = torch.stack(mlp_l0_bias_w_list) | ||
message["mlp l0 bias V"] = torch.stack(mlp_l0_bias_v_list) | ||
else: | ||
message["mlp l0 bias"] = torch.stack(mlp_l0_bias_list) | ||
message["mlp l1 bias"] = torch.stack(mlp_l1_bias_list) | ||
|
||
# Simple concat of the rest | ||
message["qkv weight"] = torch.cat(qkv_weight, dim=0) | ||
message["dense weight"] = torch.cat(dense_weight, dim=1) | ||
if md.linear_bias: | ||
message["qkv bias"] = torch.cat(qkv_bias, dim=0) | ||
elif md.add_qkv_bias: | ||
message["qkv bias"] = torch.cat(qkv_bias, dim=0) | ||
|
||
# Do nothing to router | ||
message["router weight"] = router_weight | ||
message["shared gate weight"] = shared_expert_gate_weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you refactor this block of code with better structure and reuse the duplicated code for better maintainability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
0886d56
to
87dd51b
Compare
87dd51b
to
2a07758
Compare
Hi. Thanks for your contribution. We actually already have HF->mcore non-MOE qwen 2 and 2.5 conversion but it's a little hidden as it's here: https://github.com/NVIDIA/Megatron-LM/blob/main/tools/checkpoint/loader_llama_mistral.py The usage is currently documented in Perhaps you could add the MOE support to what we already have and then we can look to merge your contribution. |
usage example: examples/qwen/README.md