Release MoEs are here! 🎉 · huggingface/nanotron

How to use nanotron's MoEs

To use nanotron's 3D parallel implementation of MoEs simply add dMoE to your modeling as such:

        self.block_sparse_moe = dMoE(
            config,
            expert_parallel_group=parallel_context.expert_pg,
            tp_pg=parallel_context.tp_pg,
            parallel_config=parallel_config,
        )

See example in examples/moe/llamoe.py
You can control expert parallelism degree by setting parallelism.expert_parallel_size and weight parallelism degree is the same as tensor parallel degree

What's Changed

Make tests pass by @NouamaneTazi in #52
Refactoring tying mechanism + small fixes by @NouamaneTazi in #62
[Docs] Fix typos by @StandardAI in #63
quick fix train steps assertion by @NouamaneTazi in #66
fix configs by @NouamaneTazi in #67
[FP8 Training] A single forward and backward pass for a linear in FP8 by @xrsrke in #56
Update bench script by @NouamaneTazi in #64
Add CI/CD for unit tests by @xrsrke in #41
Refactor ParallelContext and some process groups creation by @NouamaneTazi in #69
Support Expert Parallelism by @NouamaneTazi in #72
Add MoEs support by @NouamaneTazi in #73
Implement pipeline parallel size-agnostic optimizer state loading by @nopperl in #71

New Contributors

@StandardAI made their first contribution in #63
@nopperl made their first contribution in #71

Full Changelog: v0.1...v0.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoEs are here! 🎉

How to use nanotron's MoEs

What's Changed

New Contributors

Contributors