[DPAS]: Use 2d-loads instruction to load the operand of `tt.dot` #146

etiotto · 2023-12-18T20:05:08Z

The operands of the Triton's tt.dot operation should be loaded by using specialized instruction to load 2D blocks of the matrices.
Loading the operands in blocks is more efficient than loading them by using regular loads @llvm.genx.GenISA.LSCPrefetch.

We might need to leverage the semantic information associated with Tritons blocked pointers (https://triton-lang.org/main/getting-started/tutorials/08-experimental-block-pointer.html) in order to generate 2d-Blocked loads.

The text was updated successfully, but these errors were encountered:

tdeng5 · 2024-01-23T02:22:44Z

Liyang, please check how cuda handle stride_xx.

LiyangLingIntel · 2024-01-23T09:15:27Z

Liyang, please check how cuda handle stride_xx.

Triton CUDA pipeline lower 2d-load to TMALoadTiledOp which does not have limitation on the last dim strides.
For Triton XPU pipeline, if we want to leverage GenISA_LSC2DBlockRead, it requires last dim (stride[-1]=1) continuous for each block pointer.

For the first stage, in the pass of 2d load conversion lowering, we will check the stride attr type. If the type is a constant, and meets the 2d block load case, we can leverage genx.matrix.2Dblockload. Otherwise, it will fallback to regular loads. This is the fastest way to enable 2DblockLoad in our pipeline with limited functionalities.

For the second stage, we will consider the dynamic stride case. If the stride attr type is a variable, the lowering strategy is to using a conditional branching to decide the if the last dim stride is 1 at kernel runtime. Then pick the block loads or regular loads.

vlad-penkin · 2024-01-31T01:04:04Z

@LiyangLingIntel as per our discussion could you please split this ticket by stage

LiyangLingIntel · 2024-01-31T06:54:14Z

@LiyangLingIntel as per our discussion could you please split this ticket by stage

Sure, I have add 2 issues(#413 and #415) to split this ticket as 2 stages.

etiotto · 2024-04-29T13:11:41Z

Helping with refactoring and code review.

etiotto added LLVM Codegen enhancement New feature or request codegen: gemm labels Dec 18, 2023

vlad-penkin mentioned this issue Jan 5, 2024

GEMM performance is lower than XeTLA #140

Closed

vlad-penkin assigned chengjunlu Jan 10, 2024

tdeng5 assigned LiyangLingIntel Jan 11, 2024

whitneywhtsang removed the LLVM Codegen label Jan 24, 2024

vlad-penkin unassigned chengjunlu Jan 29, 2024

This was referenced Jan 31, 2024

Add a Conversion from tt.load to genx.matrix.2Dblockload for 2D blocked tensor pointer cases #413

Closed

[DPAS] 2D load conversion needs support input stride as dynamic value #415

Closed

LiyangLingIntel removed their assignment Feb 7, 2024

vlad-penkin added performance codegen: dpas labels Feb 9, 2024

vlad-penkin added this to the Core performance milestone Feb 9, 2024

vlad-penkin assigned LiyangLingIntel Feb 20, 2024

vlad-penkin modified the milestones: 04. Core performance, 04.1 Core performance - DPAS Mar 6, 2024

LiyangLingIntel mentioned this issue Apr 12, 2024

Lower tt.load to TritonGEN::Matrix2DBlockLoadOp for tt.dot operands #865

Closed

LiyangLingIntel linked a pull request Apr 12, 2024 that will close this issue

Lower tt.load to TritonGEN::Matrix2DBlockLoadOp for tt.dot operands #865

Closed

chengjunlu mentioned this issue Apr 16, 2024

[Performance] Enhance the Triton GEMM/Flash attention kernel performance for the default Triton passes pipeline #878

Closed

LiyangLingIntel mentioned this issue Apr 20, 2024

2D block load lowering for tt.dot operands with no intermediate op #941

Closed

This was linked to pull requests Apr 20, 2024

2D block load lowering for tt.dot operands with no intermediate op #941

Closed

Lower block pointer tt.load to 2DBlockRead #959

Merged

Rewrite RewriteTensorPointer pass to support 2D block load #958

Merged

This was unlinked from pull requests Apr 24, 2024

Lower tt.load to TritonGEN::Matrix2DBlockLoadOp for tt.dot operands #865

Closed

2D block load lowering for tt.dot operands with no intermediate op #941

Closed

etiotto self-assigned this Apr 29, 2024

etiotto closed this as completed in #958 Apr 29, 2024

chengjunlu mentioned this issue Jun 24, 2024

[Productize GEMM Performance] Features #1450

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPAS]: Use 2d-loads instruction to load the operand of `tt.dot` #146

[DPAS]: Use 2d-loads instruction to load the operand of `tt.dot` #146

etiotto commented Dec 18, 2023

tdeng5 commented Jan 23, 2024

LiyangLingIntel commented Jan 23, 2024

vlad-penkin commented Jan 31, 2024

LiyangLingIntel commented Jan 31, 2024

etiotto commented Apr 29, 2024

[DPAS]: Use 2d-loads instruction to load the operand of tt.dot #146

[DPAS]: Use 2d-loads instruction to load the operand of tt.dot #146

Comments

etiotto commented Dec 18, 2023

tdeng5 commented Jan 23, 2024

LiyangLingIntel commented Jan 23, 2024

vlad-penkin commented Jan 31, 2024

LiyangLingIntel commented Jan 31, 2024

etiotto commented Apr 29, 2024

[DPAS]: Use 2d-loads instruction to load the operand of `tt.dot` #146

[DPAS]: Use 2d-loads instruction to load the operand of `tt.dot` #146