[DOCUMENTS]Update the DPAS encoding documents. #2746

chengjunlu · 2024-11-19T04:12:57Z

Update the DPAS encoding documents based on the OCL interface requirements.

third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td

jopperm

Thanks for updating this. The ASCII art is excellent!

third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td

jopperm · 2024-11-20T09:07:30Z

third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td

+        - `sugGroupSize` Currently only sub group size 16 is supported.
+
+The values of the matrix is distributed across the threads in the subgroup as row-major order.
+  - If the column size of the matrix is equal to the number of threads in the subgroup, a single value name represents a single rows of the matrix.


What do you mean by "value name" here?

I'd like to explain the relationship of column size and threadsPerWarp in SIMT.
A value of vector type refers to the matrix in register. Each scalar of the vector refers to one or multiple rows.

// The name a of vector type refer to a matrix in register %a <4 x float> // The new name row0 and row1 defined by extractelement refer to one or multiple rows. // We can define 4 names at most from the vector. And each name may refer to one or more rows logically. %row0 = extractelement <4 x float> %a, i64 0 %row1 = extractelement <4 x float> %a, i64 1 %row2 = extractelement <4 x float> %a, i64 2 %row3 = extractelement <4 x float> %a, i64 3

Ok, I get what you want to say, but would it make sense to just talk about registers? You don't refer "value names" again later if I see correctly.

"one scalar represents one row of the matrix in register" doesn't make it clearer for me, sorry. What you meant by value name, and now refer to as scalar, is one of the t-entries (e.g. t2) in the layout visualisation below, right? Do you want so say something like covers or represents elements from a single row/multiple rows/a partial row?

I also might be misunderstanding something. @etiotto can you clear that up?

third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td

etiotto · 2024-11-20T19:42:52Z

third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td

+  - If the column size of the matrix is larger than the number of the threads in the subgroup, a single row of the matrix requires multiple value name.
+
+Example 1, the column size of the matrix is 16 and the number of threads in the subgroup is 16.
+The DPAS encoding of repeatCount=8, systolicDepth=8, executionSize=16, opsPerChannel=2 and sugGroupSize=16.


In these examples, it would be helpful to fully declare the matrix. Here you say opsPerChannel==2, so the element type of the matrices would have to be 16 bits wide. So we would have:

A: tensor<8x16xfp16> B: tensor<16x16xbf16> D: tensor<8x16xbf16>

And the DPAS encoding would be:

DpasEncoding: triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChannel = 2, threadsPerWarp = 16, warpsPerCTA = [1,1] , repCluster = [1,1]}>

Please confirm the correctness of the above. In particular is repCluster [1,1] correct ?

The layout is independent to the scalar type of the tensor. It is just decorating the tensor type to represent the distribution of the values in register.
And each operation should verify the legalize of the input operands type.

For example:

#dpas0 = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [1, 1], repCluster = [1, 1]}> #dot_operand_a = #triton_gpu.dot_op<{opIdx=0, parent=#dpas0, kWidth=2}> // this is legal operation because we can load the 8x16xfp32 tensor from memory of the dot layout // `opsPerChan=2` for fp32 as well. %1 = tt.load %0 -> tensor<8x16xfp32, #dot_operand_a> // this is illegal operation and would fail in `tt.dot` operation verifier // because we can not use the DPAS instruction on fp32 scalar of `opsPerChan=2` layout. // %2 = tt.dot %1, xxx, xxx : tensor<8x16xfp32, #dot_operand_a> * xxx -> xxx // we need to down cast the scalar type or change the layout before the `tt.dot` %2 = arthi.fptofp %1 -> tensor<8x16xfp16>, #dot_operand_a> %3 = tt.dot %2, xxx, xxx : tensor<8x16xfp16, #dot_operand_a> * xxx -> xxx

I will double confirm the example correctness.

third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td

etiotto · 2024-11-20T19:58:45Z

third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td

+t0   t1   t2   t3   t4   t5   t6   t7   t8   t9   t10  t11  t12  t13  t14  t15   |
+t0   t1   t2   t3   t4   t5   t6   t7   t8   t9   t10  t11  t12  t13  t14  t15   v
+
+Example 2, the column size of the matrix is 8 and the number of threads in the subgroup is 16.


Same as my previous comment. I think this fits but please confirm:

A: tensor<8x8xf32> B: tensor<8x16xf32> D: tensor<8x8xf32> dpasEncoding: triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChannel = 1, threadsPerWarp = 16, warpsPerCTA = [1,1] , repCluster = [1,1]}>

I update the example of the tt.dot operation. Please check.

third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td

etiotto · 2024-11-20T20:04:45Z

third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td

+along the row (resp. col) dimension.  And the repetitions are clustered of the size of repCluster to optimize the memory accessing.
+
+Suppose we have a `tt.dot` operation of the block size [64, 128] += [64, 32] * [32, 128] of hf16/bf16.
+The `warpsPerCTA` set to [2, 2]. The number of repetitions of the DPAS tile per warp is: A=8, B=8, C,D=16.


I think if you have a concrete example this would be more easier to follow. Add the total size of the A,B,D matrices that tt.dot would have. Show how, given that size and an arbitrary warpsPerCTA, we choose the dpasLayout, including repCluster.

I updated the concrete example. Is it better now?

etiotto · 2024-11-20T20:16:14Z

third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td

+The layouts for C and D operands are same as the one of opsPerChan=2.
+
+Example 3, the column size of the matrix is 32 and the number of threads in the subgroup is 16.
+The DPAS encoding of repeatCount=8, systolicDepth=8, executionSize=16, opsPerChannel=4 and sugGroupSize=16.


Why is opsPerChannel equal to 4 ? It depends on the type of the matrix element, not on the width of the column, right ?

The opsPerChan=4 is for the int8 matrix matmul accumulation.
The example 3 illustrate the layout of the DPAS operand A matrix of 8 bits type.

third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td

etiotto · 2024-11-21T15:20:48Z

third_party/intel/lib/TritonIntelGPUToLLVM/Utility.h

@@ -168,7 +168,7 @@ emitOffsetForDpasLayoutPerCTA(const DpasEncodingAttr &dpasLayout,
      sizePerThreads[rank - 2] / repCluster[rank - 2],
      sizePerThreads[rank - 1] / repCluster[rank - 1]};

-  unsigned rowsPerElem = dpasLayout.getSubGroupSize() / instShapeC[1];
+  unsigned rowsPerElem = dpasLayout.getThreadsPerWarp_() / instShapeC[1];


Why the trailing underscore ?

There is a interface in the DistributedEncodingTrait which's name is getThreadsPerWarp but with different semantic.
I can change the uses of getSubGroupSize to the getThreadsPerWarp. And contain the uses of getThreadsPerWarp_ only in Dialect.cpp.

chengjunlu · 2024-12-10T01:10:20Z

@etiotto @jopperm I updated the documents based on the comments. Please help to review the new doc.

chengjunlu commented Nov 19, 2024

View reviewed changes

third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td Show resolved Hide resolved

chengjunlu commented Nov 19, 2024

View reviewed changes

third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUAttrDefs.td Outdated Show resolved Hide resolved

chengjunlu requested review from whitneywhtsang, etiotto, sommerlukas, LiyangLingIntel and mfrancepillois November 19, 2024 04:16

chengjunlu linked an issue Nov 19, 2024 that may be closed by this pull request

Update the DPAS layout documents and code for the latest OCL DPAS interface. #2599

Closed

LiyangLingIntel approved these changes Nov 20, 2024

View reviewed changes

jopperm reviewed Nov 20, 2024

View reviewed changes

etiotto reviewed Nov 20, 2024

View reviewed changes

chengjunlu changed the title ~~Update the DPAS encoding documents.~~ [DOCUMENTS]Update the DPAS encoding documents. Nov 21, 2024

chengjunlu force-pushed the chengjun/update_doc branch 4 times, most recently from bfccedd to 2de236e Compare November 21, 2024 05:04

etiotto reviewed Nov 21, 2024

View reviewed changes

chengjunlu force-pushed the chengjun/update_doc branch 2 times, most recently from e7c373e to 7fc674a Compare November 22, 2024 01:56

chengjunlu added 2 commits November 22, 2024 09:09

Update the DPAS encoding documents.

eda4c40

Update the documents based on review comments.

7fc674a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOCUMENTS]Update the DPAS encoding documents. #2746

[DOCUMENTS]Update the DPAS encoding documents. #2746

chengjunlu commented Nov 19, 2024

jopperm left a comment

jopperm Nov 20, 2024

chengjunlu Nov 21, 2024 •

edited

Loading

jopperm Nov 21, 2024

jopperm Dec 10, 2024

etiotto Nov 20, 2024

chengjunlu Nov 21, 2024 •

edited

Loading

etiotto Nov 20, 2024

chengjunlu Nov 28, 2024

etiotto Nov 20, 2024

chengjunlu Nov 21, 2024

etiotto Nov 20, 2024

chengjunlu Nov 21, 2024

etiotto Nov 21, 2024

chengjunlu Nov 22, 2024

chengjunlu commented Dec 10, 2024

[DOCUMENTS]Update the DPAS encoding documents. #2746

Are you sure you want to change the base?

[DOCUMENTS]Update the DPAS encoding documents. #2746

Conversation

chengjunlu commented Nov 19, 2024

jopperm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chengjunlu Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chengjunlu Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chengjunlu commented Dec 10, 2024

chengjunlu Nov 21, 2024 •

edited

Loading

chengjunlu Nov 21, 2024 •

edited

Loading