Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a better performing config in the matmul example #1139

Closed
wants to merge 2 commits into from

Conversation

fcharras
Copy link

@fcharras fcharras commented May 16, 2024

It has been reported in #1122 that the performance in the matmul tutorial is way below torch.matmul performance.

After playing with the parameters I found that the current grid search does not seem adapted to the max series gpu.

Adding this set of parameters to the grid search (basically changing num_warps from 2 to 16 to the config that I found is selected as the best config) gives a big (3 times) speedup on the 512 * 512 matmul:

In [4]: %time matmul(a, b)[0,0].cpu()
CPU times: user 343 µs, sys: 674 µs, total: 1.02 ms
Wall time: 965 µs
Out[4]: tensor(12.5781, dtype=torch.float16)

In [5]: %time torch.matmul(a, b)[0,0].cpu()
CPU times: user 262 µs, sys: 513 µs, total: 775 µs
Wall time: 715 µs
Out[5]: tensor(12.5781, dtype=torch.float16)

for comparison, those are the performances I get from this tutorial on the current main branch:

In [41]: %time matmul(a, b)[0,0].cpu()
CPU times: user 2.58 ms, sys: 0 ns, total: 2.58 ms
Wall time: 2.51 ms
Out[41]: tensor(12.5781, dtype=torch.float16)

In [42]: %time torch.matmul(a, b)[0,0].cpu()
CPU times: user 0 ns, sys: 900 µs, total: 900 µs
Wall time: 826 µs
Out[42]: tensor(12.5781, dtype=torch.float16)

I didn't go further in depth in trying to change the grid search. This is just a single change that I noticed improves a lot the performance for max series. Maybe the grid search can be tweaked even further to achieve other speedups.

@vlad-penkin vlad-penkin linked an issue May 16, 2024 that may be closed by this pull request
@@ -189,6 +189,8 @@ def get_cuda_autotune_config():
num_warps=2),
triton.Config({'BLOCK_SIZE_M': 32, 'BLOCK_SIZE_N': 64, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5,
num_warps=2),
triton.Config({'BLOCK_SIZE_M': 32, 'BLOCK_SIZE_N': 64, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added get_xpu_autotune_config in #1147, we can add this config for xpu backend in that function.

@whitneywhtsang whitneywhtsang requested a review from chengjunlu May 17, 2024 04:37
@whitneywhtsang
Copy link
Contributor

We can copy the autotune configs from https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/python/tutorials/09-experimental-block-pointer.py#L102.

@whitneywhtsang
Copy link
Contributor

@whitneywhtsang
Copy link
Contributor

@fcharras Please reopen this PR or create a new PR if there are any additional configs you would like to be added. Thanks for your contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tutorial example 03 performance issue
3 participants