Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing MoE Sparse Upcycling #9

Open
adumans opened this issue Sep 9, 2024 · 13 comments
Open

Implementing MoE Sparse Upcycling #9

adumans opened this issue Sep 9, 2024 · 13 comments

Comments

@adumans
Copy link

adumans commented Sep 9, 2024

Hello OLMoE Authors:

I have read the updates on the Sparse upcycling method in readme and tried to implement it. I want to reproduce the conclusions of Sparse Upcycling in your paper that load OLMo-1B (0724) at 2T tokens.

I downloaded the corresponding checkpoint from Hugging Face, but the hf version OLMo-1B-0724-hf(revision="step954000-tokens2000B") has 2 safetensors(model-00001-of-00002.safetensors, model-00002-of-00002.safetensors), may be "safetensors.torch.load_file" used in sparsify_ckpt_unsharded.py can't load two safetensors. So I downloaded the OLMo-1B, but this version has no "tokens2000B", only "step477000-tokens2001B" is available.

So Could you please tell me:

  1. Can OLMo-1B(revision=step477000-tokens2001B) reproduce the conclusions in 4.1.5 Sparse Upcycling? Is it the same with OLMo-1B-0724-hf(revision=step954000-tokens2000B)?
  2. Or is there any other code that can load two safetensors(model-00001-of-00002.safetensors, model-00002-of-00002.safetensors in OLMo-1B-0724-hf)and run convertion from a dense model to MoE?

Thanks!

BTW, when I loaded OLMo-1B(revision=step477000-tokens2001B) using sparsify_ckpt_unsharded.py , name in state_dict is like "model.transformer.blocks.4.ff_proj.weight", the index of block is 3 not 2, but line 29 and line 51 is block_num = int(key.split(".")[2])

@adumans adumans changed the title Implementing MImplementing MoE Sparse UpcyclingoE Sparse Upcycling Implementing MoE Sparse Upcycling Sep 9, 2024
@Muennighoff
Copy link
Collaborator

Thanks for the comment! I've added more details to https://github.com/allenai/OLMoE/blob/main/README.md#other-design-choices ; lmk if you still run into problems!

@adumans
Copy link
Author

adumans commented Sep 12, 2024

Thanks for the comment! I've added more details to https://github.com/allenai/OLMoE/blob/main/README.md#other-design-choices ; lmk if you still run into problems!

I have run a small demo with a small portion of data, and the content in the output path looks like this:
config.yaml data-indices latest step1000 step2000 step3000 step4000 step4229 train_data

The structure inside the stepxxx (for example, step4000) path is like this:
`

[ 84] step4000
├── [3.7K] config.yaml
├── [ 275] model
│   ├── [ 90K] metadata.json
│   ├── [3.2G] rank_0.safetensors
│   ├── [3.2G] rank_1.safetensors
│   ├── [3.2G] rank_2.safetensors
│   ├── [3.2G] rank_3.safetensors
│   ├── [3.2G] rank_4.safetensors
│   ├── [3.2G] rank_5.safetensors
│   ├── [3.2G] rank_6.safetensors
│   └── [3.2G] rank_7.safetensors
├── [ 275] optim
│   ├── [207K] metadata.json
│   ├── [6.4G] rank_0.safetensors
│   ├── [6.4G] rank_1.safetensors
│   ├── [6.4G] rank_2.safetensors
│   ├── [6.4G] rank_3.safetensors
│   ├── [6.4G] rank_4.safetensors
│   ├── [6.4G] rank_5.safetensors
│   ├── [6.4G] rank_6.safetensors
│   └── [6.4G] rank_7.safetensors
└── [ 134] train
├── [ 14K] rank0.pt
├── [ 14K] rank1.pt
├── [ 14K] rank2.pt
├── [ 14K] rank3.pt
├── [ 14K] rank4.pt
├── [ 14K] rank5.pt
├── [ 14K] rank6.pt
└── [ 14K] rank7.pt
`

Is there a script available that can convert the output ckpt (stepxxx) into a hf(Hugging Face) format model?

@Muennighoff
Copy link
Collaborator

I just added that as 7. here: https://github.com/allenai/OLMoE/tree/main?tab=readme-ov-file#pretraining ; Lmk if still unclear!

@adumans
Copy link
Author

adumans commented Sep 13, 2024

I just added that as 7. here: https://github.com/allenai/OLMoE/tree/main?tab=readme-ov-file#pretraining ; Lmk if still unclear!

I tried running the model convertion to hf format, but I got: "KeyError: transformer.blocks.0.q_norm.weight".

So I traced back this error and found that the checkpoint you provided (here https://huggingface.co/allenai/OLMo-1B-0724-954000steps-unsharded) doesn't contain the parameters related to self-attention (q_norm, k_norm, v_norm, o_proj, etc), it only includes the parameters related to the experts (FFN, such as ffn.experts.mlp.w1, etc).

Do I need to run another script to merge these parameters? Or could you provide a checkpoint that contains all parameters? (also, an MoE weight upcycled at 2T tokens as figure 8 in the paper)

@Muennighoff
Copy link
Collaborator

For the upcycling ablation we do not use QK Norm, so just deactivate that. You can take a look at this config: https://wandb.ai/ai2-llm/olmoe/runs/1w3srbb3/overview

@adumans
Copy link
Author

adumans commented Sep 14, 2024

doesn't contain the parameters related to self-attention

The configuration I used for running the aforementioned demo is similar to this one, the output safetensors do not contain the parameters related to self-attention.

Do you mean that the parameters related to self-attention (q_norm, k_norm, v_norm, o_proj, etc) kept frozen throughout the upcycling ablation Continued pretraining process? In other words, is this part of the parameters consistent with the dense model?

@Muennighoff
Copy link
Collaborator

they were not used in the upcycling because olmo 1b does not have q_norm, k_norm, v_norm

@adumans
Copy link
Author

adumans commented Sep 14, 2024

they were not used in the upcycling because olmo 1b does not have q_norm, k_norm, v_norm

But the olmoe model has q_norm, k_norm, v_norm parameters, where did they com from? (as olmoe is upcycled from olmo)

@Muennighoff
Copy link
Collaborator

olmoe is not upcycled from olmo, sorry for the confusion. Is it not clear from the paper https://arxiv.org/abs/2409.02060 ?

@adumans
Copy link
Author

adumans commented Sep 14, 2024

olmoe is not upcycled from olmo, sorry for the confusion. Is it not clear from the paper https://arxiv.org/abs/2409.02060 ?

Sorry, I think I misunderstood this parts. Neither the upcycled MoE Nor the “training from scratch” MoE in figure 8 has the same strcuture with the final released olmoe version

@Muennighoff
Copy link
Collaborator

yes they have slightly different hyperparameters

@adumans
Copy link
Author

adumans commented Sep 20, 2024

yes they have slightly different hyperparameters

Thanks! And In the experiment of upcycling(figure 8), was any other data strategy applied to the 610 billion tokens (such as sampling, data mixing, etc)? As I noticed a new class(IterableDataset) was created to solve the problems of deterministic shuffling.

@Muennighoff
Copy link
Collaborator

It is the same dataset as used for OLMo 1B and forwarded to start from the same batch where OLMo 1B finished (via --fast_forward_batches=136153) see wandb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants