[Work in progress] Add FP8 support in fwd_prefill #115

brunomazzottiamd · 2025-01-02T16:46:04Z

No description provided.

feat: added fp32 output to input_helper passing feat: fp8 tests. small amount of error added fp8e5m2 type note: RuntimeError: "abs_cuda" not implemented for 'Float8_e4m3fnuz' enabled fp8 GEMMs fix: error down to < 0.1 added another fp8 dtype best accuracy is with no scaling improved accuracy to within < 0.02. issue related to torch side casting fix: passes if we allow v to be fp16 instead of fp8. otherwise we have error < 0.1 all error is < 0.07 feat: added per head scaling tensors progress towards implementing scaling tensors in kernel save issue: error caused by acc += tl.dot(p.to(v.type.element_ty), v)

…scaling

Error: UnboundLocalError: local variable 'q_scale_stride_z' referenced before assignment. Fix: Initialize 'q_scale_stride_z' and 'kv_scale_stride_z' before assignment.

Warning: I don't know if this is the correct thing to do.

Warning - 2 test cases are failing due to this change: AssertionError: Tensor-likes are not close! FAILED test.py::test_op_prefill_fwd_impl[False-dtype1-True-bshd-0.0-False-4-6-6-1024-1023-32] Mismatched elements: 1 / 786432 (0.0%) Greatest absolute difference: 0.14855387806892395 at index (0, 309, 2, 18) (up to 0.1009 allowed) Greatest relative difference: 0.28865116834640503 at index (0, 309, 2, 18) (up to 0.09128 allowed) FAILED test.py::test_op_prefill_fwd_impl[False-dtype1-False-bshd-0.0-False-4-6-6-1024-1023-32] Mismatched elements: 1 / 786432 (0.0%) Greatest absolute difference: 0.14855387806892395 at index (0, 309, 2, 18) (up to 0.1009 allowed) Greatest relative difference: 0.28865116834640503 at index (0, 309, 2, 18) (up to 0.09128 allowed)

Two tests are still failling.

* Do not track gradients for scale factors. * Handle maximum absolute value equals to zero in per batch / head scaling method.

q and k were just converted to fp32 5 lines before.

The intention is to document fp8 module to other devs.

Now the function accepts multiple tensors as input arguments.

Warning: * "thd" varlen layout is still not supporded. * "bshd" and "bhsd" layouts only work when HQ == HK.

Add support to MQA and GQA.

Add support to "thd" varlen layout.

The intent is to make code review easier.

This commit also reduce whitespace changes to facilitate code review.

This commit reduces code review overhead and keeps things as they are before the introduction of fp8.

This is a temporary commit and it shoud be reverted before merging. This data will be studied and then deleted from commit history.

Error tolerance is greater than when comparing fp8 Triton with fp8 PyTorch reference implementation.

micmelesse · 2025-01-07T07:58:27Z

benchmarks/benchmark_fp8.py

@@ -0,0 +1,143 @@
+# Install the newest triton version with


Let us remove the benchmarking code for now. Let us just deal with functionality and minimize the diff to main_perf

Sure, I'll remove this file.

Resolved by c0dd573.

micmelesse · 2025-01-07T07:59:30Z

flash_attn/flash_attn_triton_amd/fp8_err_data.csv

@@ -0,0 +1,545 @@
+Z,HQ,HK,N_CTX_Q,N_CTX_K,D_HEAD,causal,dropout_p,layout,use_exp2,scale_per_head,mismatched_elems,total_elems,mismatched_percentage,greatest_abs_diff,greatest_rel_diff


We should not be commiting the csv files right?

My intention is not to commit this file in any way. I just used GitHub as a shortcut to get data out of Citrix environment. I'll remove this and the other related shell script file.

Resolved by 0494786.

This reverts commit 9de6785.

flash_attn/flash_attn_triton_amd/fp8.py

alexkranias-amd and others added 30 commits December 9, 2024 10:09

fix mismatches

7434112

no navi for now

2ea54b1

fix: ref uses scaling + added ENV VAR to enable/disable quantization …

89d3d7d

…scaling

fix: fp8 ref matches kernel

c65af82

misc: added note about p_scale

9297d78

feat: added precision error test for various triton ops

f92ca5b

save

9ed1d00

feat: added benchmark for fp8 flash attention

1c3f756

fix: quantization scaling in fp8 benchmark

c4ca789

checkpoint

937e814

feat: added fp8 to precision test

fd342f7

fix: refactor fp32 for torch, moved scaling of fp8 to out of kernel

543736b

Fix test_op_fwd_prefill

210e2df

Error: UnboundLocalError: local variable 'q_scale_stride_z' referenced before assignment. Fix: Initialize 'q_scale_stride_z' and 'kv_scale_stride_z' before assignment.

Document two tests that are failing with FP8

d8dd966

Remove cast to fp16, output is already being cast to fp32

1835390

Increase error tolerance for fp8

a2624a9

Warning: I don't know if this is the correct thing to do.

Enable more test cases

8eab5e5

Fix bug for "bshd" layout

0cf49ce

Compute 1st FA GEMM without casting to fp16

4f3e633

Remove redundant v.to(v.type.element_ty) cast

6773e3a

Fix global scaling for "bhsd" and "bshd" layouts

b31cd5d

[WIP] First attempt to support "thd" layout

3044d7b

Refactor fp8 scale computation

5856c6b

Compute p scale factor and pass it to the kernel

a170a08

Fix minor coding mistakes

85c62ae

Use p scale factor in the kernel

13b07df

Two tests are still failling.

Improve scale factor generation

02a4d8f

* Do not track gradients for scale factors. * Handle maximum absolute value equals to zero in per batch / head scaling method.

Compute per batch / head fp8 scale for varlen layout

4796838

brunomazzottiamd added 11 commits December 27, 2024 20:56

Remove fp32 casts from reference implementation

eaa2dc6

q and k were just converted to fp32 5 lines before.

[CLEANUP] Remove prototype files

90eae97

[CLEANUP] Revert exploratory changes in some tests

97319de

[CLEANUP] Minimize diff hunks in the fwd_prefill kernel

19633cf

[ORG] Create new module to group fp8 related features

40caf86

[ORG] Add type annotations to fp8 module

4fe957d

The intention is to document fp8 module to other devs.

[ORG] Make check_is_fp8 function more general

73382d9

Now the function accepts multiple tensors as input arguments.

[ORG] Scale q k v outside Triton implementation

5750d29

[WIP] Implement fp8 scaling in ref. implementation

59f76f0

Warning: * "thd" varlen layout is still not supporded. * "bshd" and "bhsd" layouts only work when HQ == HK.

[WIP] Implement fp8 scaling in ref. implementation

68dbef9

Add support to MQA and GQA.

Ref. impl. supporting MQA and GQA improved the situation

1847bbd

brunomazzottiamd self-assigned this Jan 2, 2025

brunomazzottiamd requested a review from micmelesse January 2, 2025 16:46

brunomazzottiamd added 11 commits January 2, 2025 16:36

[WIP] Implement fp8 scaling in ref. implementation

7d457c4

Add support to "thd" varlen layout.

Ref. impl. supporting varlen improved the situation

4ba9d87

Decrease absolute error tolerance

9b3a095

Document and cleanup fp8.py module

9e49dc8

Reduce whitespace changes in fwd_prefill.py

0d2f0c6

The intent is to make code review easier.

Code cleanup in fwd_ref.py

5b651b9

This commit also reduce whitespace changes to facilitate code review.

Revert changes in test_op_prefill_fwd_impl test

e6a515e

This commit reduces code review overhead and keeps things as they are before the introduction of fp8.

Reduce minor whitespace changes

fbe8a37

Remove needless casts tl.float32 in 2nd GEMM

6b0e777

[TEMP] Collect fp8 error data

9de6785

This is a temporary commit and it shoud be reverted before merging. This data will be studied and then deleted from commit history.

Compare fp8 Triton with fp16 Triton in unit test

de7194f

Error tolerance is greater than when comparing fp8 Triton with fp8 PyTorch reference implementation.

micmelesse reviewed Jan 7, 2025

View reviewed changes

brunomazzottiamd added 2 commits January 7, 2025 09:24

[CLEANUP] Remove fp8 benchmark

c0dd573

Revert "[TEMP] Collect fp8 error data"

0494786

This reverts commit 9de6785.

brunomazzottiamd commented Jan 7, 2025

View reviewed changes

flash_attn/flash_attn_triton_amd/fp8.py Outdated Show resolved Hide resolved

brunomazzottiamd commented Jan 7, 2025

View reviewed changes

flash_attn/flash_attn_triton_amd/fp8.py Outdated Show resolved Hide resolved

Fix typos in comments

9cef5c2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Work in progress] Add FP8 support in fwd_prefill #115

[Work in progress] Add FP8 support in fwd_prefill #115

brunomazzottiamd commented Jan 2, 2025

micmelesse Jan 7, 2025

brunomazzottiamd Jan 7, 2025

brunomazzottiamd Jan 7, 2025

micmelesse Jan 7, 2025

brunomazzottiamd Jan 7, 2025

brunomazzottiamd Jan 7, 2025

		@@ -0,0 +1,545 @@
		Z,HQ,HK,N_CTX_Q,N_CTX_K,D_HEAD,causal,dropout_p,layout,use_exp2,scale_per_head,mismatched_elems,total_elems,mismatched_percentage,greatest_abs_diff,greatest_rel_diff

[Work in progress] Add FP8 support in fwd_prefill #115

Are you sure you want to change the base?

[Work in progress] Add FP8 support in fwd_prefill #115

Conversation

brunomazzottiamd commented Jan 2, 2025

micmelesse Jan 7, 2025

Choose a reason for hiding this comment

brunomazzottiamd Jan 7, 2025

Choose a reason for hiding this comment

brunomazzottiamd Jan 7, 2025

Choose a reason for hiding this comment

micmelesse Jan 7, 2025

Choose a reason for hiding this comment

brunomazzottiamd Jan 7, 2025

Choose a reason for hiding this comment

brunomazzottiamd Jan 7, 2025

Choose a reason for hiding this comment