-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCUMENTS]Update the DPAS encoding documents. #2746
base: main
Are you sure you want to change the base?
Conversation
third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td
Show resolved
Hide resolved
third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this. The ASCII art is excellent!
third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td
Outdated
Show resolved
Hide resolved
- `sugGroupSize` Currently only sub group size 16 is supported. | ||
|
||
The values of the matrix is distributed across the threads in the subgroup as row-major order. | ||
- If the column size of the matrix is equal to the number of threads in the subgroup, a single value name represents a single rows of the matrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "value name" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to explain the relationship of column size and threadsPerWarp in SIMT.
A value of vector type refers to the matrix in register. Each scalar of the vector refers to one or multiple rows.
// The name a of vector type refer to a matrix in register
%a <4 x float>
// The new name row0 and row1 defined by extractelement refer to one or multiple rows.
// We can define 4 names at most from the vector. And each name may refer to one or more rows logically.
%row0 = extractelement <4 x float> %a, i64 0
%row1 = extractelement <4 x float> %a, i64 1
%row2 = extractelement <4 x float> %a, i64 2
%row3 = extractelement <4 x float> %a, i64 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I get what you want to say, but would it make sense to just talk about registers? You don't refer "value names" again later if I see correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"one scalar represents one row of the matrix in register" doesn't make it clearer for me, sorry. What you meant by value name, and now refer to as scalar, is one of the t-entries (e.g. t2
) in the layout visualisation below, right? Do you want so say something like covers or represents elements from a single row/multiple rows/a partial row?
I also might be misunderstanding something. @etiotto can you clear that up?
third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td
Outdated
Show resolved
Hide resolved
third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td
Outdated
Show resolved
Hide resolved
- If the column size of the matrix is larger than the number of the threads in the subgroup, a single row of the matrix requires multiple value name. | ||
|
||
Example 1, the column size of the matrix is 16 and the number of threads in the subgroup is 16. | ||
The DPAS encoding of repeatCount=8, systolicDepth=8, executionSize=16, opsPerChannel=2 and sugGroupSize=16. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In these examples, it would be helpful to fully declare the matrix. Here you say opsPerChannel==2, so the element type of the matrices would have to be 16 bits wide. So we would have:
A: tensor<8x16xfp16>
B: tensor<16x16xbf16>
D: tensor<8x16xbf16>
And the DPAS encoding would be:
DpasEncoding: triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChannel = 2, threadsPerWarp = 16, warpsPerCTA = [1,1] , repCluster = [1,1]}>
Please confirm the correctness of the above. In particular is repCluster [1,1] correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The layout is independent to the scalar type of the tensor. It is just decorating the tensor type to represent the distribution of the values in register.
And each operation should verify the legalize of the input operands type.
For example:
#dpas0 = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [1, 1], repCluster = [1, 1]}>
#dot_operand_a = #triton_gpu.dot_op<{opIdx=0, parent=#dpas0, kWidth=2}>
// this is legal operation because we can load the 8x16xfp32 tensor from memory of the dot layout
// `opsPerChan=2` for fp32 as well.
%1 = tt.load %0 -> tensor<8x16xfp32, #dot_operand_a>
// this is illegal operation and would fail in `tt.dot` operation verifier
// because we can not use the DPAS instruction on fp32 scalar of `opsPerChan=2` layout.
// %2 = tt.dot %1, xxx, xxx : tensor<8x16xfp32, #dot_operand_a> * xxx -> xxx
// we need to down cast the scalar type or change the layout before the `tt.dot`
%2 = arthi.fptofp %1 -> tensor<8x16xfp16>, #dot_operand_a>
%3 = tt.dot %2, xxx, xxx : tensor<8x16xfp16, #dot_operand_a> * xxx -> xxx
I will double confirm the example correctness.
third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td
Outdated
Show resolved
Hide resolved
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 | | ||
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 v | ||
|
||
Example 2, the column size of the matrix is 8 and the number of threads in the subgroup is 16. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as my previous comment. I think this fits but please confirm:
A: tensor<8x8xf32>
B: tensor<8x16xf32>
D: tensor<8x8xf32>
dpasEncoding: triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChannel = 1, threadsPerWarp = 16, warpsPerCTA = [1,1] , repCluster = [1,1]}>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I update the example of the tt.dot
operation. Please check.
third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td
Outdated
Show resolved
Hide resolved
along the row (resp. col) dimension. And the repetitions are clustered of the size of repCluster to optimize the memory accessing. | ||
|
||
Suppose we have a `tt.dot` operation of the block size [64, 128] += [64, 32] * [32, 128] of hf16/bf16. | ||
The `warpsPerCTA` set to [2, 2]. The number of repetitions of the DPAS tile per warp is: A=8, B=8, C,D=16. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you have a concrete example this would be more easier to follow. Add the total size of the A,B,D matrices that tt.dot would have. Show how, given that size and an arbitrary warpsPerCTA
, we choose the dpasLayout, including repCluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the concrete example. Is it better now?
The layouts for C and D operands are same as the one of opsPerChan=2. | ||
|
||
Example 3, the column size of the matrix is 32 and the number of threads in the subgroup is 16. | ||
The DPAS encoding of repeatCount=8, systolicDepth=8, executionSize=16, opsPerChannel=4 and sugGroupSize=16. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is opsPerChannel
equal to 4 ? It depends on the type of the matrix element, not on the width of the column, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The opsPerChan=4
is for the int8 matrix matmul accumulation.
The example 3 illustrate the layout of the DPAS operand A matrix of 8 bits type.
bfccedd
to
2de236e
Compare
third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td
Outdated
Show resolved
Hide resolved
@@ -168,7 +168,7 @@ emitOffsetForDpasLayoutPerCTA(const DpasEncodingAttr &dpasLayout, | |||
sizePerThreads[rank - 2] / repCluster[rank - 2], | |||
sizePerThreads[rank - 1] / repCluster[rank - 1]}; | |||
|
|||
unsigned rowsPerElem = dpasLayout.getSubGroupSize() / instShapeC[1]; | |||
unsigned rowsPerElem = dpasLayout.getThreadsPerWarp_() / instShapeC[1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the trailing underscore ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a interface in the DistributedEncodingTrait
which's name is getThreadsPerWarp
but with different semantic.
I can change the uses of getSubGroupSize
to the getThreadsPerWarp
. And contain the uses of getThreadsPerWarp_
only in Dialect.cpp
.
e7c373e
to
7fc674a
Compare
Update the DPAS encoding documents based on the OCL interface requirements.