Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCUMENTS]Update the DPAS encoding documents. #2746

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

chengjunlu
Copy link
Contributor

Update the DPAS encoding documents based on the OCL interface requirements.

Copy link
Contributor

@jopperm jopperm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this. The ASCII art is excellent!

- `sugGroupSize` Currently only sub group size 16 is supported.

The values of the matrix is distributed across the threads in the subgroup as row-major order.
- If the column size of the matrix is equal to the number of threads in the subgroup, a single value name represents a single rows of the matrix.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "value name" here?

Copy link
Contributor Author

@chengjunlu chengjunlu Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to explain the relationship of column size and threadsPerWarp in SIMT.
A value of vector type refers to the matrix in register. Each scalar of the vector refers to one or multiple rows.

// The name a of vector type refer to a matrix in register
%a <4 x float>
// The new name row0 and row1 defined by extractelement refer to one or multiple rows.
// We can define 4 names at most from the vector. And each name may refer to one or more rows logically.
%row0 = extractelement <4 x float> %a, i64 0
%row1 = extractelement <4 x float> %a, i64 1
%row2 = extractelement <4 x float> %a, i64 2
%row3 = extractelement <4 x float> %a, i64 3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I get what you want to say, but would it make sense to just talk about registers? You don't refer "value names" again later if I see correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"one scalar represents one row of the matrix in register" doesn't make it clearer for me, sorry. What you meant by value name, and now refer to as scalar, is one of the t-entries (e.g. t2) in the layout visualisation below, right? Do you want so say something like covers or represents elements from a single row/multiple rows/a partial row?

I also might be misunderstanding something. @etiotto can you clear that up?

- If the column size of the matrix is larger than the number of the threads in the subgroup, a single row of the matrix requires multiple value name.

Example 1, the column size of the matrix is 16 and the number of threads in the subgroup is 16.
The DPAS encoding of repeatCount=8, systolicDepth=8, executionSize=16, opsPerChannel=2 and sugGroupSize=16.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these examples, it would be helpful to fully declare the matrix. Here you say opsPerChannel==2, so the element type of the matrices would have to be 16 bits wide. So we would have:

A: tensor<8x16xfp16>
B: tensor<16x16xbf16>
D: tensor<8x16xbf16>

And the DPAS encoding would be:

DpasEncoding: triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChannel = 2, threadsPerWarp = 16,  warpsPerCTA = [1,1] , repCluster = [1,1]}>

Please confirm the correctness of the above. In particular is repCluster [1,1] correct ?

Copy link
Contributor Author

@chengjunlu chengjunlu Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The layout is independent to the scalar type of the tensor. It is just decorating the tensor type to represent the distribution of the values in register.
And each operation should verify the legalize of the input operands type.

For example:

#dpas0 = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [1, 1], repCluster = [1, 1]}>
#dot_operand_a = #triton_gpu.dot_op<{opIdx=0, parent=#dpas0, kWidth=2}>

// this is legal operation because we can load the 8x16xfp32 tensor from memory of the dot layout 
// `opsPerChan=2` for fp32 as well.
%1 = tt.load %0 -> tensor<8x16xfp32, #dot_operand_a> 

// this is illegal operation and would fail in `tt.dot` operation verifier 
// because we can not use the DPAS instruction on fp32 scalar of `opsPerChan=2` layout.
// %2 = tt.dot %1, xxx, xxx : tensor<8x16xfp32, #dot_operand_a> * xxx -> xxx 

// we need to down cast the scalar type or change the layout before the `tt.dot`
%2 = arthi.fptofp %1 -> tensor<8x16xfp16>, #dot_operand_a> 
%3 = tt.dot %2, xxx, xxx : tensor<8x16xfp16, #dot_operand_a> * xxx -> xxx 

I will double confirm the example correctness.

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 |
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 v

Example 2, the column size of the matrix is 8 and the number of threads in the subgroup is 16.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as my previous comment. I think this fits but please confirm:

A: tensor<8x8xf32>
B: tensor<8x16xf32>
D: tensor<8x8xf32>
dpasEncoding:  triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChannel = 1, threadsPerWarp = 16,  warpsPerCTA = [1,1] , repCluster = [1,1]}>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I update the example of the tt.dot operation. Please check.

along the row (resp. col) dimension. And the repetitions are clustered of the size of repCluster to optimize the memory accessing.

Suppose we have a `tt.dot` operation of the block size [64, 128] += [64, 32] * [32, 128] of hf16/bf16.
The `warpsPerCTA` set to [2, 2]. The number of repetitions of the DPAS tile per warp is: A=8, B=8, C,D=16.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you have a concrete example this would be more easier to follow. Add the total size of the A,B,D matrices that tt.dot would have. Show how, given that size and an arbitrary warpsPerCTA, we choose the dpasLayout, including repCluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the concrete example. Is it better now?

The layouts for C and D operands are same as the one of opsPerChan=2.

Example 3, the column size of the matrix is 32 and the number of threads in the subgroup is 16.
The DPAS encoding of repeatCount=8, systolicDepth=8, executionSize=16, opsPerChannel=4 and sugGroupSize=16.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is opsPerChannel equal to 4 ? It depends on the type of the matrix element, not on the width of the column, right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opsPerChan=4 is for the int8 matrix matmul accumulation.
The example 3 illustrate the layout of the DPAS operand A matrix of 8 bits type.

@chengjunlu chengjunlu changed the title Update the DPAS encoding documents. [DOCUMENTS]Update the DPAS encoding documents. Nov 21, 2024
@chengjunlu chengjunlu force-pushed the chengjun/update_doc branch 4 times, most recently from bfccedd to 2de236e Compare November 21, 2024 05:04
@@ -168,7 +168,7 @@ emitOffsetForDpasLayoutPerCTA(const DpasEncodingAttr &dpasLayout,
sizePerThreads[rank - 2] / repCluster[rank - 2],
sizePerThreads[rank - 1] / repCluster[rank - 1]};

unsigned rowsPerElem = dpasLayout.getSubGroupSize() / instShapeC[1];
unsigned rowsPerElem = dpasLayout.getThreadsPerWarp_() / instShapeC[1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the trailing underscore ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a interface in the DistributedEncodingTrait which's name is getThreadsPerWarp but with different semantic.
I can change the uses of getSubGroupSize to the getThreadsPerWarp. And contain the uses of getThreadsPerWarp_ only in Dialect.cpp.

@chengjunlu chengjunlu force-pushed the chengjun/update_doc branch 2 times, most recently from e7c373e to 7fc674a Compare November 22, 2024 01:56
@chengjunlu
Copy link
Contributor Author

@etiotto @jopperm I updated the documents based on the comments. Please help to review the new doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update the DPAS layout documents and code for the latest OCL DPAS interface.
4 participants