[Draft] To get reasonable performance on the default Triton passes pipeline. #897

chengjunlu · 2024-04-17T08:48:32Z

Draft only for CI.

Added:

Intel rewriter tensor pointer pass.
Intel remove layout pass.
Intel accelerate matmul pass.
Intel materialize 2D load pass.
Intel loop pipelining with prefetching.
Use the intel specific passes pipeline.
Use large 2D load
Lower the prefetching op to 2D prefetching op.

TODO:

Benchmark.
Use the sub-group-size=32.
Use the 2D store
Add double GRF as a compile configuration.
Change the convert layout and emit index to dense stride for the dot operands layout and DPAS layout,

Memo: The sub-group-size=32 cause some DPAS UT fail.
Memo: Need to upstream the nested layout of the dot operand layout. Slice of dot layout. dot of dot layout.

… could be supported by Intel GPU hardware 2D memory accessing. To protect the block pointer from being re-write in RewriteTensorPointer pass.

…ntics initiated by a `tt.make_tensor_ptr` and `tt.advance` in TTGIR.

Such as `#triton_gpu.slice<{dim = 1, parent = #triton_gpu.dot_op<{opIdx = 0, parent = #mma}>}>`

Add `threads_per_warp` (equivalent to subgroup size) field to `Config` so autotuner can use that as an additional parameter during its workflow. Signed-off-by: Victor Perez <victor.perez@codeplay.com>

chengjunlu linked an issue Apr 17, 2024 that may be closed by this pull request

[Performance] Enhance the Triton GEMM/Flash attention kernel performance for the default Triton passes pipeline #878

Closed

chengjunlu force-pushed the chengjun/llvm-target-dev branch 13 times, most recently from 759e4e1 to d3c482b Compare April 24, 2024 06:05

chengjunlu force-pushed the chengjun/llvm-target-dev branch 4 times, most recently from 75ed1be to ec24ea5 Compare April 29, 2024 08:48

chengjunlu force-pushed the chengjun/llvm-target-dev branch 4 times, most recently from 7b22c45 to 0c25d6a Compare May 8, 2024 08:33

whitneywhtsang marked this pull request as draft May 11, 2024 04:17

chengjunlu force-pushed the chengjun/llvm-target-dev branch 2 times, most recently from 22daa3f to bd400b0 Compare May 15, 2024 06:19

chengjunlu closed this May 15, 2024

chengjunlu added 4 commits May 15, 2024 10:11

benchmark codes.

61912b9

Add prefetch ops lowering pattern.

706c438

Add MaterializeBlockPointer pass to mark the tt.load operations which…

fe89a95

… could be supported by Intel GPU hardware 2D memory accessing. To protect the block pointer from being re-write in RewriteTensorPointer pass.

Add Intel GPU 2D memory accessing for 2D load in code generation.

90e2c98

chengjunlu and others added 15 commits May 15, 2024 10:11

Add more sanity check in SharedToDotOperandDPAS.cpp.

2e2a4b3

Add Intel rewriter tensor pointer pass to rewrite all load/store sema…

8bb4220

…ntics initiated by a `tt.make_tensor_ptr` and `tt.advance` in TTGIR.

Update the llPrintf signature.

0acd95e

Use the GenISA.

b0d2db8

Use the DPAS when threads_per_warp=32.

585bfec

Change threads_per_warp to 16.

aa40452

Update the compiler pipeline

479ef9c

Add the nested layout with dot op layout.

7609c5c

Such as `#triton_gpu.slice<{dim = 1, parent = #triton_gpu.dot_op<{opIdx = 0, parent = #mma}>}>`

Fix issue to support fp64

c83861b

Add debug print thread info

2b6bacf

Add tensor print utils.

8bc0087

coordinate go first for compare

38dfe5b

clean up

36259ca

[Upstream] Add threads_per_warp field to autotuner configurations

cd32f9a

Add `threads_per_warp` (equivalent to subgroup size) field to `Config` so autotuner can use that as an additional parameter during its workflow. Signed-off-by: Victor Perez <victor.perez@codeplay.com>

Use the autotune with threads_per_warp for softmax.

bd400b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] To get reasonable performance on the default Triton passes pipeline. #897

[Draft] To get reasonable performance on the default Triton passes pipeline. #897

chengjunlu commented Apr 17, 2024 •

edited

Loading

[Draft] To get reasonable performance on the default Triton passes pipeline. #897

[Draft] To get reasonable performance on the default Triton passes pipeline. #897

Conversation

chengjunlu commented Apr 17, 2024 • edited Loading

chengjunlu commented Apr 17, 2024 •

edited

Loading