FA3 Tracking #11

TJ-Solergibert · 2024-07-12T20:20:30Z

In this branch, we will track the evolution of FA3. The current state is:

No MQA/GQA
No BF16
Requires contiguous inputs

From the flash-attention repo:

Coming soon in the next couple of days / next week:
BF16
Variable length (FP16, BF16)
FP8 forward

Installation

Refer to the official FA repo.
This will install the package flashattn-hopper, so you can still have flash_attnfor the LayerNorm and RoPE embeddings.

Run experiments

In this branch, I have added the configuration to choose between FA2 & FA3 to more easily compare the performance of both (model.model_config.use_fa3). You can use the configuration examples/config_llama3_fa3.yaml that will build a Llama3-8B model but with fewer decoder layers to fit the 8192 sequence length in 1 GPU.

If we use GH200 nodes, having more VRAM will allow us to use num_hidden_layers = 11. In systems with H100, use num_hidden_layers = 8. Don't forget to edit the dataset_folder and tokenizer_name_or_path fields if necessary.

Performance

I will keep updating this table as new features are incorporated, as they mentioned, they are currently in a beta release. The MFU reported is the one computed by nanotron.

Date	FA	Precision	num_attention_heads	num_key_value_heads	num_hidden_layers	Batch Size	Sequence Length	MFU (TFLOPs)	VRAM
12/7	2	bf16	32	32	11	1	8192	348	94
12/7	2	fp16	32	32	11	1	8192	381	94
12/7	3	fp16	32	32	11	1	8192	446	94
12/7	2	bf16	32	32	8	1	8192	342	77
12/7	2	fp16	32	32	8	1	8192	370	77
12/7	3	fp16	32	32	8	1	8192	433	77

Adding FA3 support

3cfc795

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FA3 Tracking #11

FA3 Tracking #11

TJ-Solergibert commented Jul 12, 2024

FA3 Tracking #11

Are you sure you want to change the base?

FA3 Tracking #11

Conversation

TJ-Solergibert commented Jul 12, 2024

Installation

Run experiments

Performance