Use DotOp layout for UpcastMXFPOp Lowering #3057

LiyangLingIntel · 2024-12-20T21:27:45Z

This pull request support dot layout codegen for upcast_mxfp operation, which could be more efficient than previous blocked layout implementation.

The 2 skipped tests are failed for L0 runtime error, they will be addressed in a seperate PR #2968.

LiyangLingIntel · 2025-01-03T06:35:11Z

test/TritonIntelGPU/accelerate-matmul-pvc.mlir

+    // CHECK: [[CVT_ARG0:%.*]] = ttg.convert_layout [[ARG0]] : tensor<128x32xi8, [[BLOCKED]]> -> tensor<128x32xi8, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>
+    // CHECK: [[CVT_ARG1:%.*]] = ttg.convert_layout [[ARG1]] : tensor<128x2xi8, [[BLOCKED1]]> -> tensor<128x2xi8, [[BLOCKED3]]>
+    // CHECK: [[UPCAST:%.*]] = ttg.upcast_mxfp [[CVT_ARG0]], [[CVT_ARG1]] fp_type = e2m1 : tensor<128x32xi8, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>, tensor<128x2xi8, [[BLOCKED3]]> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS1]], kWidth = 4}>>
+    // CHECK: [[A:%.*]] = ttg.convert_layout [[UPCAST]] : tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS1]], kWidth = 4}>> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>


For E2M1, 2 fp4 elements are packed in 1 int8, when upcasting like <32x32xi8> to <32x64xbf16>, the output dot layout must have double-sized contiguous element access for each thread than the input dot layout.

Common code use kWidth, like change 4 -> 8 for e2m1 to bf16. For Intel GPU DPAS layout, we can change the OpsPerChannel from 2 to 4 to meet the requirement and convert back with ConvertLayoutOp.

Input tensor layout:

#ttg.dot_op<{opIdx = 0, parent = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [4, 1], repCluster = [1, 1], A = [8, 16], B = [16, 16], C = [8, 16]}>, kWidth = 2}>
[ T0:0, T1:0, ... T14:0, T15:0, T0:8, T1:8, ... T14:8, T15:8]
[ T0:1, T1:1, ... T14:1, T15:1, T0:9, T1:9, ... T14:9, T15:9]
...
[ T0:7, T1:7, ... T14:7, T15:7, T0:15, T1:15, ... T14:15, T15:15]
[ T16:0, T17:0, ... T30:0, T31:0, T16:8, T17:8, ... T30:8, T31:8]
...
[ T16:7, T17:7, ... T30:7, T31:7, T16:15, T17:15, ... T30:15, T31:15]
[ T32:0, T33:0, ... T46:0, T47:0, T32:8, T33:8, ... T46:8, T47:8]
...
[ T48:0, T49:0, ... T62:0, T63:0, T48:8, T49:8, ... T62:8, T63:8]
...

Output tensor layout:

#ttg.dot_op<{opIdx = 0, parent = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 4, threadsPerWarp = 16, warpsPerCTA = [4, 1], repCluster = [1, 1], A = [8, 32], B = [32, 16], C = [8, 16]}>, kWidth = 4}>
[ T0:0, T0:1, ... T15:0, T15:1, T0:16, T0:17, ... T15:16, T15:17]
[ T0:2, T0:3, ... T15:2, T15:3, T0:18, T0:19, ... T15:18, T15:19]
...
[ T0:14, T0:15, ... T15:14, T15:15, T0:30, T0:31, ... T15:30, T15:31]
[ T16:0, T16:1, ... T31:0, T31:1, T16:16, T16:17, ... T31:16, T31:17]
...
[ T16:14, T16:15, ... T31:14, T31:15, T16:30, T16:31, ... T31:30, T31:31]
[ T32:0, T32:1, ... T47:0, T47:1, T32:16, T32:17, ... T47:16, T47:17]
...
[ T48:0, T48:1, ... T63:0, T63:1, T48:16, T48:17, ... T63:16, T63:17]
...

test/TritonIntelGPU/accelerate-matmul-pvc.mlir

LiyangLingIntel · 2025-01-03T06:49:12Z

test/TritonIntelGPU/accelerate-matmul-pvc.mlir

+      // CHECK: [[B:%.*]] = tt.fp_to_fp [[CVT_ARG0]] : tensor<64x128xf8E4M3FN, #ttg.dot_op<{opIdx = 1, parent = [[DPAS]], kWidth = 2}>> -> tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = [[DPAS]], kWidth = 2}>>
+      // CHECK: [[D:%.*]] = tt.dot [[A]], [[B]], [[C]] : tensor<32x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>> * tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = [[DPAS]], kWidth = 2}>> -> tensor<32x128xf32, [[DPAS]]>
+      // CHECK: [[RES:%.*]] = ttg.convert_layout [[D]] : tensor<32x128xf32, [[DPAS]]> -> tensor<32x128xf32, [[BLOCKED4]]>
+      // CHECK: scf.yield [[RES]] : tensor<32x128xf32, [[BLOCKED4]]>


Here we do tranpose for dot_scaled operands before lowring to convert RHS UpcastMXFP to LHS UpcastMXFP, instread of directly implementing RHS UpcastMXFP.

In the case of RHS scaling with dot layout, each thread access elements in column which requires scaling values from threads across warps, however, we can only shuffle threads values in the same warp. So I kept the same logic with the upstream.

third_party/intel/lib/TritonIntelGPUToLLVM/UpcastMXFPToLLVM.cpp

third_party/intel/lib/TritonIntelGPUTransforms/AccelerateMatmul.cpp

Fix e2m1

LiyangLingIntel linked an issue Dec 20, 2024 that may be closed by this pull request

[Performance] Enhance tritongpu.upcast_mxfp with dot layout #2961

Closed

LiyangLingIntel self-assigned this Dec 20, 2024

LiyangLingIntel force-pushed the liyang/upcast_dot_layout branch 2 times, most recently from 151eee4 to 8b0f018 Compare January 3, 2025 05:56

LiyangLingIntel changed the title ~~[WIP] Use DotOp layout for UpcastMXFPOp Lowering~~ Use DotOp layout for UpcastMXFPOp Lowering Jan 3, 2025

LiyangLingIntel marked this pull request as ready for review January 3, 2025 05:56

LiyangLingIntel requested review from etiotto, whitneywhtsang, chengjunlu and a team January 3, 2025 06:06

LiyangLingIntel commented Jan 3, 2025

View reviewed changes

chengjunlu reviewed Jan 3, 2025

View reviewed changes

test/TritonIntelGPU/accelerate-matmul-pvc.mlir Show resolved Hide resolved

LiyangLingIntel commented Jan 3, 2025

View reviewed changes

LiyangLingIntel force-pushed the liyang/upcast_dot_layout branch 2 times, most recently from 585f4e6 to 0fd0510 Compare January 3, 2025 14:09

chengjunlu approved these changes Jan 6, 2025

View reviewed changes

LiyangLingIntel mentioned this pull request Jan 7, 2025

UploadMXFPOp optimization with inlining vISA #3106

Open

etiotto reviewed Jan 7, 2025

View reviewed changes

LiyangLingIntel added 6 commits January 8, 2025 05:15

Fix fp8 lhs dot layout scaling

7d05508

Add transpose to support rhs scaling dot

b26b453

Fix e2m1

d2c4535

Fix e2m1

Remove blocked layout lowering lowering

7069871

Fix lit test

585040d

clean up

40d5304

LiyangLingIntel force-pushed the liyang/upcast_dot_layout branch from d3072e5 to 385f456 Compare January 8, 2025 05:16

Address review comments

85e1a04

LiyangLingIntel force-pushed the liyang/upcast_dot_layout branch from 385f456 to 85e1a04 Compare January 8, 2025 05:56

whitneywhtsang requested a review from etiotto January 8, 2025 16:08

whitneywhtsang approved these changes Jan 8, 2025

View reviewed changes

etiotto approved these changes Jan 8, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into liyang/upcast_dot_layout

2da2be9

whitneywhtsang merged commit b9da9cc into main Jan 8, 2025
5 checks passed

whitneywhtsang deleted the liyang/upcast_dot_layout branch January 8, 2025 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use DotOp layout for UpcastMXFPOp Lowering #3057

Use DotOp layout for UpcastMXFPOp Lowering #3057

LiyangLingIntel commented Dec 20, 2024 •

edited

Loading

LiyangLingIntel Jan 3, 2025

LiyangLingIntel Jan 3, 2025

Use DotOp layout for UpcastMXFPOp Lowering #3057

Use DotOp layout for UpcastMXFPOp Lowering #3057

Conversation

LiyangLingIntel commented Dec 20, 2024 • edited Loading

LiyangLingIntel Jan 3, 2025

Choose a reason for hiding this comment

LiyangLingIntel Jan 3, 2025

Choose a reason for hiding this comment

LiyangLingIntel commented Dec 20, 2024 •

edited

Loading