-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use DotOp layout for UpcastMXFPOp Lowering #3057
Conversation
151eee4
to
8b0f018
Compare
// CHECK: [[CVT_ARG0:%.*]] = ttg.convert_layout [[ARG0]] : tensor<128x32xi8, [[BLOCKED]]> -> tensor<128x32xi8, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>> | ||
// CHECK: [[CVT_ARG1:%.*]] = ttg.convert_layout [[ARG1]] : tensor<128x2xi8, [[BLOCKED1]]> -> tensor<128x2xi8, [[BLOCKED3]]> | ||
// CHECK: [[UPCAST:%.*]] = ttg.upcast_mxfp [[CVT_ARG0]], [[CVT_ARG1]] fp_type = e2m1 : tensor<128x32xi8, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>, tensor<128x2xi8, [[BLOCKED3]]> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS1]], kWidth = 4}>> | ||
// CHECK: [[A:%.*]] = ttg.convert_layout [[UPCAST]] : tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS1]], kWidth = 4}>> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For E2M1, 2 fp4 elements are packed in 1 int8, when upcasting like <32x32xi8> to <32x64xbf16>, the output dot layout must have double-sized contiguous element access for each thread than the input dot layout.
Common code use kWidth
, like change 4 -> 8 for e2m1 to bf16. For Intel GPU DPAS layout, we can change the OpsPerChannel from 2 to 4 to meet the requirement and convert back with ConvertLayoutOp.
Input tensor layout:
#ttg.dot_op<{opIdx = 0, parent = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [4, 1], repCluster = [1, 1], A = [8, 16], B = [16, 16], C = [8, 16]}>, kWidth = 2}>
[ T0:0, T1:0, ... T14:0, T15:0, T0:8, T1:8, ... T14:8, T15:8]
[ T0:1, T1:1, ... T14:1, T15:1, T0:9, T1:9, ... T14:9, T15:9]
...
[ T0:7, T1:7, ... T14:7, T15:7, T0:15, T1:15, ... T14:15, T15:15]
[ T16:0, T17:0, ... T30:0, T31:0, T16:8, T17:8, ... T30:8, T31:8]
...
[ T16:7, T17:7, ... T30:7, T31:7, T16:15, T17:15, ... T30:15, T31:15]
[ T32:0, T33:0, ... T46:0, T47:0, T32:8, T33:8, ... T46:8, T47:8]
...
[ T48:0, T49:0, ... T62:0, T63:0, T48:8, T49:8, ... T62:8, T63:8]
...
Output tensor layout:
#ttg.dot_op<{opIdx = 0, parent = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 4, threadsPerWarp = 16, warpsPerCTA = [4, 1], repCluster = [1, 1], A = [8, 32], B = [32, 16], C = [8, 16]}>, kWidth = 4}>
[ T0:0, T0:1, ... T15:0, T15:1, T0:16, T0:17, ... T15:16, T15:17]
[ T0:2, T0:3, ... T15:2, T15:3, T0:18, T0:19, ... T15:18, T15:19]
...
[ T0:14, T0:15, ... T15:14, T15:15, T0:30, T0:31, ... T15:30, T15:31]
[ T16:0, T16:1, ... T31:0, T31:1, T16:16, T16:17, ... T31:16, T31:17]
...
[ T16:14, T16:15, ... T31:14, T31:15, T16:30, T16:31, ... T31:30, T31:31]
[ T32:0, T32:1, ... T47:0, T47:1, T32:16, T32:17, ... T47:16, T47:17]
...
[ T48:0, T48:1, ... T63:0, T63:1, T48:16, T48:17, ... T63:16, T63:17]
...
// CHECK: [[B:%.*]] = tt.fp_to_fp [[CVT_ARG0]] : tensor<64x128xf8E4M3FN, #ttg.dot_op<{opIdx = 1, parent = [[DPAS]], kWidth = 2}>> -> tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = [[DPAS]], kWidth = 2}>> | ||
// CHECK: [[D:%.*]] = tt.dot [[A]], [[B]], [[C]] : tensor<32x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>> * tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = [[DPAS]], kWidth = 2}>> -> tensor<32x128xf32, [[DPAS]]> | ||
// CHECK: [[RES:%.*]] = ttg.convert_layout [[D]] : tensor<32x128xf32, [[DPAS]]> -> tensor<32x128xf32, [[BLOCKED4]]> | ||
// CHECK: scf.yield [[RES]] : tensor<32x128xf32, [[BLOCKED4]]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we do tranpose for dot_scaled operands before lowring to convert RHS UpcastMXFP to LHS UpcastMXFP, instread of directly implementing RHS UpcastMXFP.
In the case of RHS scaling with dot layout, each thread access elements in column which requires scaling values from threads across warps, however, we can only shuffle threads values in the same warp. So I kept the same logic with the upstream.
585f4e6
to
0fd0510
Compare
third_party/intel/lib/TritonIntelGPUToLLVM/UpcastMXFPToLLVM.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUToLLVM/UpcastMXFPToLLVM.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/AccelerateMatmul.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/AccelerateMatmul.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/AccelerateMatmul.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/AccelerateMatmul.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/AccelerateMatmul.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/AccelerateMatmul.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/AccelerateMatmul.cpp
Outdated
Show resolved
Hide resolved
d3072e5
to
385f456
Compare
385f456
to
85e1a04
Compare
This pull request support dot layout codegen for upcast_mxfp operation, which could be more efficient than previous blocked layout implementation.
The 2 skipped tests are failed for L0 runtime error, they will be addressed in a seperate PR #2968.