2D block load lowering for tt.dot operands with no intermediate op #941

LiyangLingIntel · 2024-04-20T14:08:35Z

This PR is for #146
Based on discussion in #865, this is another version of implementation that we do not rely on the intermediate op triton_intel_gpu.load_2d. Instead, we have to make changes to RewriteTensorPointer pass.
The strategy is to copy common RewriteTensorPointer pass to Intel GPU passes, do not rewrite tt.load op with TensorPoiner, as what the previous NV pass did. So that we can have it in later stage and lower it to llvm.genx.GenISA.LSC2DBlockRead.

include/triton/Dialect/TritonIntelGPU/Transforms/Passes.td

chengjunlu · 2024-04-22T01:00:32Z

The RewriterTensorPointer.cpp is totally new in this PR. It is hard to review what is the change we made to it.

Please create a commit to copy the TTIR pass rewriter tensor pointer to the TT Intel GPU IR repo first. And then based on that, we can review the customized changes for Intel clearly.

Refine Intel RewriteTensorPtr pass Except store in rewrite tensorptr Refine Intel RewriteTensorPtr pass

LiyangLingIntel · 2024-04-22T14:22:40Z

The RewriterTensorPointer.cpp is totally new in this PR. It is hard to review what is the change we made to it.

Please create a commit to copy the TTIR pass rewriter tensor pointer to the TT Intel GPU IR repo first. And then based on that, we can review the customized changes for Intel clearly.

Rebased and adjusted the commit history, we can view this commit c01f7ac#diff-43a0aeab44c0c355cbd24ce57853a07b38d96b98d4d5b10a8a2e3dfbf121fdc4 to see the changes between Triton common RewriteTensorPointer pass and Intel RewriteTensorPointer pass.

etiotto

We should have a PR to copy the RewriteTensorPointer.cpp file over to the intel directory (a NFC PR). Then rebase this one.

include/triton/Dialect/TritonIntelGPU/Transforms/Passes.td

test/TritonIntelGPU/load-to-llvm-2dload.mlir

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

…no-intermediate-op

test/TritonIntelGPU/rewrite-tensor-pointer.mlir

LiyangLingIntel · 2024-04-23T17:42:27Z

Based on offline discussion, I separated this pull request to 2 for the better reviewing:

I have marked this to draft for now and will close it when conversation under this PR are all resolved.

third_party/intel/lib/TritonIntelGPUTransforms/RewriteTensorPointer.cpp

This is the first PR separated from #941 This PR focuses on rewriting the `RewriteTensorPointer` pass, so we can allow `tt.load` with tensor pointer pattern in our compilation pipeline, rather than being rewriten to legacy load. --------- Signed-off-by: Tiotto, Ettore <ettore.tiotto@intel.com> Co-authored-by: Whitney Tsang <whitney.tsang@intel.com> Co-authored-by: Tiotto, Ettore <ettore.tiotto@intel.com>

This is the second PR separated from #941 This PR focuses on lowering `tt.load` with tensor pointer to `Triyton::Matrix2DBlockLoad`. --------- Co-authored-by: Whitney Tsang <whitney.tsang@intel.com> Co-authored-by: Tiotto, Ettore <ettore.tiotto@intel.com>

This was linked to issues Apr 20, 2024

[DPAS]: Use 2d-loads instruction to load the operand of tt.dot #146

Closed

Add a Conversion from tt.load to genx.matrix.2Dblockload for 2D blocked tensor pointer cases #413

Closed

LiyangLingIntel force-pushed the liyang/2dload-no-intermediate-op branch from b8ba9b9 to bbc1576 Compare April 20, 2024 14:14

LiyangLingIntel requested review from whitneywhtsang, etiotto and chengjunlu April 20, 2024 14:15

chengjunlu reviewed Apr 22, 2024

View reviewed changes

include/triton/Dialect/TritonIntelGPU/Transforms/Passes.td Outdated Show resolved Hide resolved

LiyangLingIntel added 3 commits April 21, 2024 18:24

Simply copy Triton common RewriteTensorPointer pass to Intel passes

490b6ab

Add Intel GPU RewriteTensorPointer pass

c01f7ac

Refine Intel RewriteTensorPtr pass Except store in rewrite tensorptr Refine Intel RewriteTensorPtr pass

Lowering tt.load with TensorPointer to Matrix2DBlockLoad

dbf31c4

LiyangLingIntel force-pushed the liyang/2dload-no-intermediate-op branch 2 times, most recently from 358743e to 03df366 Compare April 22, 2024 01:53

Update pass description

9865fee

LiyangLingIntel force-pushed the liyang/2dload-no-intermediate-op branch from 03df366 to 9865fee Compare April 22, 2024 01:56

LiyangLingIntel mentioned this pull request Apr 22, 2024

Add a Conversion from tt.load to genx.matrix.2Dblockload for 2D blocked tensor pointer cases #413

Closed

etiotto reviewed Apr 22, 2024

View reviewed changes

whitneywhtsang added 3 commits April 22, 2024 18:32

Merge remote-tracking branch 'origin/llvm-target' into liyang/2dload-…

3162d43

…no-intermediate-op

Fix merge

4368740

address review comments

6bea9fc

LiyangLingIntel mentioned this pull request Apr 23, 2024

Lower tt.load to TritonGEN::Matrix2DBlockLoadOp for tt.dot operands #865

Closed

etiotto reviewed Apr 23, 2024

View reviewed changes

test/TritonIntelGPU/rewrite-tensor-pointer.mlir Show resolved Hide resolved

LiyangLingIntel marked this pull request as draft April 23, 2024 17:07

This was referenced Apr 23, 2024

Rewrite RewriteTensorPointer pass to support 2D block load #958

Merged

Lower block pointer tt.load to 2DBlockRead #959

Merged

chengjunlu reviewed Apr 24, 2024

View reviewed changes

third_party/intel/lib/TritonIntelGPUTransforms/RewriteTensorPointer.cpp Show resolved Hide resolved

chengjunlu reviewed Apr 24, 2024

View reviewed changes

third_party/intel/lib/TritonIntelGPUTransforms/RewriteTensorPointer.cpp Show resolved Hide resolved

chengjunlu reviewed Apr 24, 2024

View reviewed changes

third_party/intel/lib/TritonIntelGPUTransforms/RewriteTensorPointer.cpp Show resolved Hide resolved

chengjunlu reviewed Apr 24, 2024

View reviewed changes

third_party/intel/lib/TritonIntelGPUTransforms/RewriteTensorPointer.cpp Show resolved Hide resolved

LiyangLingIntel mentioned this pull request Apr 24, 2024

[Performance] Support 2D block load column major matrix for GEMM operand B . #965

Closed

LiyangLingIntel closed this Apr 24, 2024

LiyangLingIntel deleted the liyang/2dload-no-intermediate-op branch April 24, 2024 13:47

This was unlinked from issues Apr 24, 2024

[DPAS]: Use 2d-loads instruction to load the operand of tt.dot #146

Closed

Add a Conversion from tt.load to genx.matrix.2Dblockload for 2D blocked tensor pointer cases #413

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2D block load lowering for tt.dot operands with no intermediate op #941

2D block load lowering for tt.dot operands with no intermediate op #941

LiyangLingIntel commented Apr 20, 2024 •

edited

Loading

chengjunlu commented Apr 22, 2024

LiyangLingIntel commented Apr 22, 2024

etiotto left a comment

LiyangLingIntel commented Apr 23, 2024

2D block load lowering for tt.dot operands with no intermediate op #941

2D block load lowering for tt.dot operands with no intermediate op #941

Conversation

LiyangLingIntel commented Apr 20, 2024 • edited Loading

chengjunlu commented Apr 22, 2024

LiyangLingIntel commented Apr 22, 2024

etiotto left a comment

Choose a reason for hiding this comment

LiyangLingIntel commented Apr 23, 2024

LiyangLingIntel commented Apr 20, 2024 •

edited

Loading