MoEs in src/ and proper load balancing losses #8

haeggee · 2024-07-03T08:20:34Z

Copy from PR, huggingface#192, issue huggingface#159

Hey everyone :)

As mentioned in the issue, I think it's worth it to move the MoE code into source, so I'm working on it here. MoEs have become a common/standard architecture choice for LLMs now. With this PR, it would be possible to run MoEs via the pip installed nanotron -- no need to be inside examples/ folder. Also, auxiliary losses are essential for training MoEs at scale (stability, load balancing & max utilization!).

Points I've tried to make sure:

Not much more code in src (the moe.py file in models, some if statements inside the main model, the forwarding of additional aux loss value)

With the config setup, a standard dense model is simply one with num_experts=1. For >1, the dropless MoE is used

Working for all parallelisms without errors

With latest commit, pipeline parallel also has correct logging: the load balancing losses are computed inside the MLPs and passed forward + summed cumulatively

z-loss

Questions and input needed would be on:

Is there a more elegant way to forward the aux losses?

Inside the pipeline engine, it's currently hardcoded to check for "load_balancing_loss" for the backward pass (just like it was before with "loss"). Would you want to make this more generic, i.e., loop over all keys in the dictionary to store them? Note that the two losses are separate (and not added up before) for logging.

Please feel free to give feedback or request other changes! I want to make sure it's clean and correct.

For Swiss AI: need to create a conversion script for MoEs. Probably makes sense to add in another PR @ischlag ?

…cing loss and adapt logging

…tion includes moes

… logs bc loss forwarded through layers

… code

… of dense model

haeggee · 2024-07-04T12:15:18Z

@TJ-Solergibert suggested that we could make the MoE models a fully new model file (just like llama.py) and also have a new config, in order to keep things clean, separate, and this wouldn't break the current configs/conversion etc.

The downsides of this would be that we add a lot of copied code since the model definition is essentially the same, apart from which MLP definition to use and the extended forward of aux_losses. Any thoughts on this?

TJ-Solergibert · 2024-07-04T14:11:02Z

Hi! Yes, I suggest to keep the model apart with its own model definition and config. This is a common practice in HuggingFace transformers, check the mistral model comments. I think that having separate models will help us maintain the project in the future and will be something we'll do sooner or later once we add new models.

For the aux_loss, as it's just for logging purposes, we will try to reduce/gather it at the end of each iteration instead of sending them through all pipeline stages.

ischlag · 2024-07-09T10:07:13Z

hey @AleHD I believe you discussed it with @TJ-Solergibert, is this ready to merge?
@TJ-Solergibert I assigned you as the reviewer for this PR.

AleHD and others added 13 commits April 9, 2024 15:20

Implemented wandb entity configuration

0966bb8

move moe in src of nanotron, update config

88934e8

Merge branch 'huggingface:main' into moe

e905cc4

1: switch to swiglu (llama), 2: fix router weights, 3: add load balan…

d37a66c

…cing loss and adapt logging

Merge branch 'huggingface:main' into moe

4793fcb

fix pipeline bug with moe load balancing loss; correct compute estima…

1cc9cb3

…tion includes moes

proper swiglu for mlp in expert parallel

5a06abb

Merge remote-tracking branch 'upstream/main' into moe

abbc784

add nn.linear module in init parametrization (for moe router)

d0bacc7

refactor: aux loss computed in forward of each mlp -> enables correct…

86ac964

… logs bc loss forwarded through layers

fix generation since introduction of moes

620a884

fix generation in case of pipeline parallel

29df086

fix expert weights, move lbl in log space for stability, start z-loss…

34ba0ab

… code

haeggee changed the title ~~Moe~~ MoEs in src/ and proper load balancing losses Jul 3, 2024

haeggee added 3 commits July 3, 2024 11:39

add z-loss, make aux_losses a dict to be bit cleaner

b2420e1

forgot division by num_experts for z-loss

7c37b69

fix bug of dummy aux_loss activations registered for backward in case…

3cde7de

… of dense model

ischlag assigned TJ-Solergibert and unassigned TJ-Solergibert Jul 9, 2024

ischlag requested a review from TJ-Solergibert July 9, 2024 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoEs in src/ and proper load balancing losses #8

MoEs in src/ and proper load balancing losses #8

haeggee commented Jul 3, 2024 •

edited

Loading

haeggee commented Jul 4, 2024

TJ-Solergibert commented Jul 4, 2024

ischlag commented Jul 9, 2024

MoEs in src/ and proper load balancing losses #8

Are you sure you want to change the base?

MoEs in src/ and proper load balancing losses #8

Conversation

haeggee commented Jul 3, 2024 • edited Loading

haeggee commented Jul 4, 2024

TJ-Solergibert commented Jul 4, 2024

ischlag commented Jul 9, 2024

haeggee commented Jul 3, 2024 •

edited

Loading