-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Conversion from tt.load
to genx.matrix.2Dblockload
for 2D blocked tensor pointer cases
#413
Comments
Besides the stride limitation, there are still restrictions because of HW limitation.:
|
For cases in 08-experimental-block-pointer and test_block_pointer.py, they produce IR with constant last dim stride. |
Based on the previous discussion, submitted #941, we have 2 ways of implementation. The idea is similar, differences would be
|
Split #941 to 2 PRs for better review: |
Helping with refactoring & code review. |
This issue is the first stage of task [DPAS]: Use 2d-loads instruction to load the operand of tt.dot #146
Triton CUDA pipeline lower 2d-load to TMALoadTiledOp which does not have limitation on the last dim strides.
For Triton XPU pipeline, if we want to leverage GenISA_LSC2DBlockRead, it requires last dim (stride[-1]=1) continuous for each block pointer.
So, in the first stage, in the pass of 2d load conversion lowering, we will check the stride attr type. If the type is a constant, and meets the 2d block load case, we can leverage
genx.matrix.2Dblockload
. Otherwise, it will fallback to regular loads.The text was updated successfully, but these errors were encountered: