Update theoretical memory footprint formula #1345
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
number of parameters
Because there is
2 * args.num_layers * args.hidden_size * args.hidden_size
, the coefficient of layernorms should be 1, not 2.final layernorm
Since the final layernorm is not relevant for stages other than the last pipeline stage, it is necessary to subtract the number of parameters for the final layernorm using
- args.hidden_size
.gradients' type
When
args.accumulate_allreduce_grads_in_fp32
is True, the coefficient can be set to 6 from parameters(bf16) + gradients(fp32), but when it is False, it becomes 4, so case separation is necessary.flash attention
Since it is possible to calculate not only when
args.recompute_granularity
is selective, but also whenargs.use_flash_attn
is True, it is added to the conditions.SwiGLU
Added conditional branching to support both GPT and Llama architectures
other changes