Add a Conversion from `tt.load` to `genx.matrix.2Dblockload` for 2D blocked tensor pointer cases #413

LiyangLingIntel · 2024-01-31T06:31:41Z

This issue is the first stage of task [DPAS]: Use 2d-loads instruction to load the operand of tt.dot #146

Triton CUDA pipeline lower 2d-load to TMALoadTiledOp which does not have limitation on the last dim strides.
For Triton XPU pipeline, if we want to leverage GenISA_LSC2DBlockRead, it requires last dim (stride[-1]=1) continuous for each block pointer.

So, in the first stage, in the pass of 2d load conversion lowering, we will check the stride attr type. If the type is a constant, and meets the 2d block load case, we can leverage genx.matrix.2Dblockload. Otherwise, it will fallback to regular loads.

The text was updated successfully, but these errors were encountered:

LiyangLingIntel · 2024-02-07T16:41:40Z

Besides the stride limitation, there are still restrictions because of HW limitation.:

The base_pitch must be a multiple of QW (8 bytes)
Should pay attention to the size of tile_width, tile_height, they are mulitiplied by the element size (bytes) must be a multiple of 4 bytes, no greater than 63.

LiyangLingIntel · 2024-02-07T16:47:48Z

For cases in 08-experimental-block-pointer and test_block_pointer.py, they produce IR with constant last dim stride.
So these cases can be covered in the situation talked in this issue.

LiyangLingIntel · 2024-04-22T05:56:09Z

Based on the previous discussion, submitted #941, we have 2 ways of implementation. The idea is similar, differences would be

Lower tt.load to TritonGEN::Matrix2DBlockLoadOp for tt.dot operands #865
Using intermediate op triton_intel_gpu.load_2d to work with Triton common passes, like RewriteTensorPointer, with minor code changes.
2D block load lowering for tt.dot operands with no intermediate op #941
Do not need a extra op in Intel GPU dialect, but this way rewrites the common RewriteTensorPointer pass like previously NV did. So that we can keep tt.load pattern with tensor pointer for the later lowering in to-llvm stage.

LiyangLingIntel · 2024-04-24T13:50:51Z

Split #941 to 2 PRs for better review:

etiotto · 2024-04-29T13:12:25Z

Helping with refactoring & code review.

LiyangLingIntel added enhancement New feature or request codegen: gemm labels Jan 31, 2024

LiyangLingIntel self-assigned this Jan 31, 2024

This was referenced Jan 31, 2024

[DPAS] 2D load conversion needs support input stride as dynamic value #415

Closed

[DPAS]: Use 2d-loads instruction to load the operand of tt.dot #146

Closed

LiyangLingIntel removed their assignment Feb 7, 2024

vlad-penkin added performance codegen: dpas labels Feb 9, 2024

vlad-penkin added this to the Core performance milestone Feb 9, 2024

LiyangLingIntel self-assigned this Feb 18, 2024

vlad-penkin modified the milestones: 04. Core performance, 04.1 Core performance - DPAS Mar 6, 2024

LiyangLingIntel mentioned this issue Apr 12, 2024

Lower tt.load to TritonGEN::Matrix2DBlockLoadOp for tt.dot operands #865

Closed

LiyangLingIntel linked a pull request Apr 12, 2024 that will close this issue

Lower tt.load to TritonGEN::Matrix2DBlockLoadOp for tt.dot operands #865

Closed

chengjunlu mentioned this issue Apr 16, 2024

[Performance] Enhance the Triton GEMM/Flash attention kernel performance for the default Triton passes pipeline #878

Closed

LiyangLingIntel linked a pull request Apr 20, 2024 that will close this issue

2D block load lowering for tt.dot operands with no intermediate op #941

Closed

This was linked to pull requests Apr 24, 2024

Lower block pointer tt.load to 2DBlockRead #959

Merged

Rewrite RewriteTensorPointer pass to support 2D block load #958

Merged

This was unlinked from pull requests Apr 24, 2024

Lower tt.load to TritonGEN::Matrix2DBlockLoadOp for tt.dot operands #865

Closed

2D block load lowering for tt.dot operands with no intermediate op #941

Closed

etiotto self-assigned this Apr 29, 2024

etiotto closed this as completed in #958 Apr 29, 2024

chengjunlu mentioned this issue Jun 24, 2024

[Productize GEMM Performance] Features #1450

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Conversion from `tt.load` to `genx.matrix.2Dblockload` for 2D blocked tensor pointer cases #413

Add a Conversion from `tt.load` to `genx.matrix.2Dblockload` for 2D blocked tensor pointer cases #413

LiyangLingIntel commented Jan 31, 2024

LiyangLingIntel commented Feb 7, 2024

LiyangLingIntel commented Feb 7, 2024

LiyangLingIntel commented Apr 22, 2024 •

edited

Loading

LiyangLingIntel commented Apr 24, 2024

etiotto commented Apr 29, 2024

Add a Conversion from tt.load to genx.matrix.2Dblockload for 2D blocked tensor pointer cases #413

Add a Conversion from tt.load to genx.matrix.2Dblockload for 2D blocked tensor pointer cases #413

Comments

LiyangLingIntel commented Jan 31, 2024

LiyangLingIntel commented Feb 7, 2024

LiyangLingIntel commented Feb 7, 2024

LiyangLingIntel commented Apr 22, 2024 • edited Loading

LiyangLingIntel commented Apr 24, 2024

etiotto commented Apr 29, 2024

Add a Conversion from `tt.load` to `genx.matrix.2Dblockload` for 2D blocked tensor pointer cases #413

Add a Conversion from `tt.load` to `genx.matrix.2Dblockload` for 2D blocked tensor pointer cases #413

LiyangLingIntel commented Apr 22, 2024 •

edited

Loading