MoEs are here! 🎉
How to use nanotron's MoEs
To use nanotron's 3D parallel implementation of MoEs simply add dMoE
to your modeling as such:
self.block_sparse_moe = dMoE(
config,
expert_parallel_group=parallel_context.expert_pg,
tp_pg=parallel_context.tp_pg,
parallel_config=parallel_config,
)
See example in examples/moe/llamoe.py
You can control expert parallelism degree by setting parallelism.expert_parallel_size
and weight parallelism degree is the same as tensor parallel degree
What's Changed
- Make tests pass by @NouamaneTazi in #52
- Refactoring tying mechanism + small fixes by @NouamaneTazi in #62
- [
Docs
] Fix typos by @StandardAI in #63 - quick fix train steps assertion by @NouamaneTazi in #66
- fix configs by @NouamaneTazi in #67
- [FP8 Training] A single forward and backward pass for a linear in FP8 by @xrsrke in #56
- Update bench script by @NouamaneTazi in #64
- Add CI/CD for unit tests by @xrsrke in #41
- Refactor
ParallelContext
and some process groups creation by @NouamaneTazi in #69 - Support Expert Parallelism by @NouamaneTazi in #72
- Add MoEs support by @NouamaneTazi in #73
- Implement pipeline parallel size-agnostic optimizer state loading by @nopperl in #71
New Contributors
- @StandardAI made their first contribution in #63
- @nopperl made their first contribution in #71
Full Changelog: v0.1...v0.2