Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DotOp layout for UpcastMXFPOp Lowering #3057

Merged
merged 8 commits into from
Jan 8, 2025

Conversation

LiyangLingIntel
Copy link
Contributor

@LiyangLingIntel LiyangLingIntel commented Dec 20, 2024

This pull request support dot layout codegen for upcast_mxfp operation, which could be more efficient than previous blocked layout implementation.

The 2 skipped tests are failed for L0 runtime error, they will be addressed in a seperate PR #2968.

@LiyangLingIntel LiyangLingIntel linked an issue Dec 20, 2024 that may be closed by this pull request
@LiyangLingIntel LiyangLingIntel self-assigned this Dec 20, 2024
@LiyangLingIntel LiyangLingIntel force-pushed the liyang/upcast_dot_layout branch 2 times, most recently from 151eee4 to 8b0f018 Compare January 3, 2025 05:56
@LiyangLingIntel LiyangLingIntel changed the title [WIP] Use DotOp layout for UpcastMXFPOp Lowering Use DotOp layout for UpcastMXFPOp Lowering Jan 3, 2025
@LiyangLingIntel LiyangLingIntel marked this pull request as ready for review January 3, 2025 05:56
// CHECK: [[CVT_ARG0:%.*]] = ttg.convert_layout [[ARG0]] : tensor<128x32xi8, [[BLOCKED]]> -> tensor<128x32xi8, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>
// CHECK: [[CVT_ARG1:%.*]] = ttg.convert_layout [[ARG1]] : tensor<128x2xi8, [[BLOCKED1]]> -> tensor<128x2xi8, [[BLOCKED3]]>
// CHECK: [[UPCAST:%.*]] = ttg.upcast_mxfp [[CVT_ARG0]], [[CVT_ARG1]] fp_type = e2m1 : tensor<128x32xi8, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>, tensor<128x2xi8, [[BLOCKED3]]> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS1]], kWidth = 4}>>
// CHECK: [[A:%.*]] = ttg.convert_layout [[UPCAST]] : tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS1]], kWidth = 4}>> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For E2M1, 2 fp4 elements are packed in 1 int8, when upcasting like <32x32xi8> to <32x64xbf16>, the output dot layout must have double-sized contiguous element access for each thread than the input dot layout.

Common code use kWidth, like change 4 -> 8 for e2m1 to bf16. For Intel GPU DPAS layout, we can change the OpsPerChannel from 2 to 4 to meet the requirement and convert back with ConvertLayoutOp.

Input tensor layout:

#ttg.dot_op<{opIdx = 0, parent = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [4, 1], repCluster = [1, 1], A = [8, 16], B = [16, 16], C = [8, 16]}>, kWidth = 2}>
[ T0:0, T1:0, ... T14:0, T15:0, T0:8, T1:8, ... T14:8, T15:8]
[ T0:1, T1:1, ... T14:1, T15:1, T0:9, T1:9, ... T14:9, T15:9]
...
[ T0:7, T1:7, ... T14:7, T15:7, T0:15, T1:15, ... T14:15, T15:15]
[ T16:0, T17:0, ... T30:0, T31:0, T16:8, T17:8, ... T30:8, T31:8]
...
[ T16:7, T17:7, ... T30:7, T31:7, T16:15, T17:15, ... T30:15, T31:15]
[ T32:0, T33:0, ... T46:0, T47:0, T32:8, T33:8, ... T46:8, T47:8]
...
[ T48:0, T49:0, ... T62:0, T63:0, T48:8, T49:8, ... T62:8, T63:8]
...

Output tensor layout:

#ttg.dot_op<{opIdx = 0, parent = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 4, threadsPerWarp = 16, warpsPerCTA = [4, 1], repCluster = [1, 1], A = [8, 32], B = [32, 16], C = [8, 16]}>, kWidth = 4}>
[ T0:0, T0:1, ... T15:0, T15:1, T0:16, T0:17, ... T15:16, T15:17]
[ T0:2, T0:3, ... T15:2, T15:3, T0:18, T0:19, ... T15:18, T15:19]
...
[ T0:14, T0:15, ... T15:14, T15:15, T0:30, T0:31, ... T15:30, T15:31]
[ T16:0, T16:1, ... T31:0, T31:1, T16:16, T16:17, ... T31:16, T31:17]
...
[ T16:14, T16:15, ... T31:14, T31:15, T16:30, T16:31, ... T31:30, T31:31]
[ T32:0, T32:1, ... T47:0, T47:1, T32:16, T32:17, ... T47:16, T47:17]
...
[ T48:0, T48:1, ... T63:0, T63:1, T48:16, T48:17, ... T63:16, T63:17]
...

// CHECK: [[B:%.*]] = tt.fp_to_fp [[CVT_ARG0]] : tensor<64x128xf8E4M3FN, #ttg.dot_op<{opIdx = 1, parent = [[DPAS]], kWidth = 2}>> -> tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = [[DPAS]], kWidth = 2}>>
// CHECK: [[D:%.*]] = tt.dot [[A]], [[B]], [[C]] : tensor<32x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>> * tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = [[DPAS]], kWidth = 2}>> -> tensor<32x128xf32, [[DPAS]]>
// CHECK: [[RES:%.*]] = ttg.convert_layout [[D]] : tensor<32x128xf32, [[DPAS]]> -> tensor<32x128xf32, [[BLOCKED4]]>
// CHECK: scf.yield [[RES]] : tensor<32x128xf32, [[BLOCKED4]]>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we do tranpose for dot_scaled operands before lowring to convert RHS UpcastMXFP to LHS UpcastMXFP, instread of directly implementing RHS UpcastMXFP.

In the case of RHS scaling with dot layout, each thread access elements in column which requires scaling values from threads across warps, however, we can only shuffle threads values in the same warp. So I kept the same logic with the upstream.

@LiyangLingIntel LiyangLingIntel force-pushed the liyang/upcast_dot_layout branch 2 times, most recently from 585f4e6 to 0fd0510 Compare January 3, 2025 14:09
@LiyangLingIntel LiyangLingIntel force-pushed the liyang/upcast_dot_layout branch from d3072e5 to 385f456 Compare January 8, 2025 05:16
@LiyangLingIntel LiyangLingIntel force-pushed the liyang/upcast_dot_layout branch from 385f456 to 85e1a04 Compare January 8, 2025 05:56
@whitneywhtsang whitneywhtsang requested a review from etiotto January 8, 2025 16:08
@whitneywhtsang whitneywhtsang merged commit b9da9cc into main Jan 8, 2025
5 checks passed
@whitneywhtsang whitneywhtsang deleted the liyang/upcast_dot_layout branch January 8, 2025 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Performance] Enhance tritongpu.upcast_mxfp with dot layout
5 participants