-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve GEMM perf when one matrix is transposed #2347
Conversation
third_party/intel/lib/TritonIntelGPUTransforms/MaterializeBlockPointer.cpp
Outdated
Show resolved
Hide resolved
@alexbaden |
Follows the Triton llvm debug syntax. Allows you to dump various parameters when running `triton-opt` with `-debug`. cc #2347
Per the Triton slack, `order` is unused on architecture below Hopper. But more importantly, order provides information that stride already has. In fact, order can be completely different from stride (i.e. wrong) and we still generate correct code. I think it is better to use the stride assuming the logic I added here makes sense. Note this depends on #2348, I'd like to land the debug logging separately, so we have it even if we decide to modify this approach. It was very useful in debugging this problem. cc #2347
63cb029
to
835e8a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference is to remove RewriteTensorPointer pass after #2181 is landed.
Would that work for your test case?
Yes, this change is only to keep rewrite tensor pointer from removing the blocked load. |
835e8a0
to
4705bde
Compare
Merging #2181 resolved the fused attention issue because the transposed load is lowered to a gather load instead of failing to lower to 2D blocked load. |
I think that removing the If we land PR #2359 this PR can be rebased presumably and the LIT test kept. |
@chengjunlu @whitneywhtsang I modified the RewriteTensorPointer pass to remove
main:
matrix performance is unchanged |
third_party/intel/lib/TritonIntelGPUTransforms/RewriteTensorPointer.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/RewriteTensorPointer.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking that we could check the blockio attribute of the users of the MakeTensorPtrOp, that way we don't need to maintain a copy of the same condition checks in RewriteTensorPointer. WDYT?
Interesting - I can try it. But let's make it a separate PR. I want to show some progress on the original transpose issue, and I am not 100% confident that I understand the logic well enough to handle all cases, so it will need some testing/review. |
Sure, let's do that in a separate PR. |
The 2D block load/store does not work when one of the input matrices to a
tt.dot
is transposed inside the Triton kernel using thestride
parameter. In the user example, the block pointer is transposed using stride but theorder
parameter is left unchanged. This results inmaterialize-block-pointer
being unable to detect that ablock_io
attributecolumn-major
should be added to the matrix. Even if this attribute were added,rewrite-tensor-pointer
would remove the block pointer because column major was not supported.This PR adds support for detecting
column-major
based onstride
instead oforder
and also brings the same logic torewrite-tensor-pointer
to allow for the column major load to be preserved and eventually lowered to a 2D block load. With this, transpose matrix performance is more inline with the non-transposed version:I know we have plans to remove
rewrite-tensor-pointer
eventually, but column major support did not appear difficult, and the performance is nearly 3x better.I am planning to PR each commit individually starting with the debug logging, but wanted to open this umbrella PR to show how the entire pipeline fits together.
Close #1795