Update theoretical memory footprint formula #1345

okoge-kaz · 2025-01-03T12:27:27Z

number of parameters

- + (2 / args.hidden_size)
+ + (1 / args.hidden_size)

Because there is 2 * args.num_layers * args.hidden_size * args.hidden_size, the coefficient of layernorms should be 1, not 2.

final layernorm

- (num_parameters_in_transformer_layers / args.pipeline_model_parallel_size)
+ (num_parameters_in_transformer_layers - args.hidden_size) / args.pipeline_model_parallel_size

Since the final layernorm is not relevant for stages other than the last pipeline stage, it is necessary to subtract the number of parameters for the final layernorm using - args.hidden_size.

gradients' type

- 6 + (12 / args.data_parallel_size)
+ (2 + gradient_accumulation_factor) + (12 / args.data_parallel_size

When args.accumulate_allreduce_grads_in_fp32 is True, the coefficient can be set to 6 from parameters(bf16) + gradients(fp32), but when it is False, it becomes 4, so case separation is necessary.

flash attention

- if not args.sequence_parallel or args.recompute_granularity != 'selective':
+ if not args.sequence_parallel or not (
+      args.recompute_granularity == 'selective' or args.use_flash_attn is True
+  ):

Since it is possible to calculate not only when args.recompute_granularity is selective, but also when args.use_flash_attn is True, it is added to the conditions.

SwiGLU

            (
                # SwiGLU
                2 * b * s * h  # input
                + 2 * b * s * args.ffn_hidden_size  # up_proj
                + 2 * b * s * args.ffn_hidden_size  # gate_proj
                + 2 * b * s * args.ffn_hidden_size  # act_fn
                + 2 * b * s * args.ffn_hidden_size  # down_proj
            ) if args.swiglu else (
                2 * b * s * h  # h -> ffn_h
                + 2 * b * s * args.ffn_hidden_size  # act
                + 2 * b * s * args.ffn_hidden_size  # ffn_h  -> h
            )

Added conditional branching to support both GPT and Llama architectures

other changes

Theoritical memory footprint is easier to read in GB units, so I changed it from MB to GB.
Change so that the theoretical memory footprint per GPU when CP (Context Parallelism) is enabled can be correctly calculated.
Support GQA(Grouped Query Attention).

…_fp32 is False

… selective recomputation is used, but also when flash attention is used

…unt swiglu, GQA, and CP

okoge-kaz added 6 commits January 3, 2025 16:55

Fix number of parameters calculation logic

3f7c06a

Fix calc logic for num of param on first pipeline stage

03d1e9d

Fix calc logic for number of param when accumulate_allreduce_grads_in…

fa2cafb

…_fp32 is False

Change the display unit for memory footprints from MB to GB

1c11f75

Change the condition so that memory footprint is output not only when…

1242f36

… selective recomputation is used, but also when flash attention is used

Update to an activation memory footprint formula that takes into acco…

07df205

…unt swiglu, GQA, and CP

okoge-kaz marked this pull request as ready for review January 3, 2025 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update theoretical memory footprint formula #1345

Update theoretical memory footprint formula #1345

okoge-kaz commented Jan 3, 2025

Update theoretical memory footprint formula #1345

Are you sure you want to change the base?

Update theoretical memory footprint formula #1345

Conversation

okoge-kaz commented Jan 3, 2025

number of parameters

final layernorm

gradients' type

flash attention

SwiGLU

other changes