Add a better performing config in the matmul example #1139

fcharras · 2024-05-16T10:16:51Z

It has been reported in #1122 that the performance in the matmul tutorial is way below torch.matmul performance.

After playing with the parameters I found that the current grid search does not seem adapted to the max series gpu.

Adding this set of parameters to the grid search (basically changing num_warps from 2 to 16 to the config that I found is selected as the best config) gives a big (3 times) speedup on the 512 * 512 matmul:

In [4]: %time matmul(a, b)[0,0].cpu()
CPU times: user 343 µs, sys: 674 µs, total: 1.02 ms
Wall time: 965 µs
Out[4]: tensor(12.5781, dtype=torch.float16)

In [5]: %time torch.matmul(a, b)[0,0].cpu()
CPU times: user 262 µs, sys: 513 µs, total: 775 µs
Wall time: 715 µs
Out[5]: tensor(12.5781, dtype=torch.float16)

for comparison, those are the performances I get from this tutorial on the current main branch:

In [41]: %time matmul(a, b)[0,0].cpu()
CPU times: user 2.58 ms, sys: 0 ns, total: 2.58 ms
Wall time: 2.51 ms
Out[41]: tensor(12.5781, dtype=torch.float16)

In [42]: %time torch.matmul(a, b)[0,0].cpu()
CPU times: user 0 ns, sys: 900 µs, total: 900 µs
Wall time: 826 µs
Out[42]: tensor(12.5781, dtype=torch.float16)

I didn't go further in depth in trying to change the grid search. This is just a single change that I noticed improves a lot the performance for max series. Maybe the grid search can be tweaked even further to achieve other speedups.

whitneywhtsang · 2024-05-17T04:32:37Z

python/tutorials/03-matrix-multiplication.py

@@ -189,6 +189,8 @@ def get_cuda_autotune_config():
                      num_warps=2),
        triton.Config({'BLOCK_SIZE_M': 32, 'BLOCK_SIZE_N': 64, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5,
                      num_warps=2),
+        triton.Config({'BLOCK_SIZE_M': 32, 'BLOCK_SIZE_N': 64, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5,


Added get_xpu_autotune_config in #1147, we can add this config for xpu backend in that function.

whitneywhtsang · 2024-05-17T04:39:30Z

We can copy the autotune configs from https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/python/tutorials/09-experimental-block-pointer.py#L102.

whitneywhtsang · 2024-05-23T20:02:44Z

We can copy the autotune configs from https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/python/tutorials/09-experimental-block-pointer.py#L102.

Done in #1185.

whitneywhtsang · 2024-05-23T20:32:23Z

@fcharras Please reopen this PR or create a new PR if there are any additional configs you would like to be added. Thanks for your contribution.

Add a better performing config in the matmul example

075662f

fcharras mentioned this pull request May 16, 2024

Tutorial example 03 performance issue #1122

Open

vlad-penkin requested a review from whitneywhtsang May 16, 2024 19:25

vlad-penkin linked an issue May 16, 2024 that may be closed by this pull request

Tutorial example 03 performance issue #1122

Open

whitneywhtsang reviewed May 17, 2024

View reviewed changes

whitneywhtsang requested a review from chengjunlu May 17, 2024 04:37

Merge branch 'llvm-target' into patch-1

0863826

whitneywhtsang closed this May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a better performing config in the matmul example #1139

Add a better performing config in the matmul example #1139

fcharras commented May 16, 2024 •

edited

Loading

whitneywhtsang May 17, 2024

whitneywhtsang commented May 17, 2024

whitneywhtsang commented May 23, 2024

whitneywhtsang commented May 23, 2024

Add a better performing config in the matmul example #1139

Add a better performing config in the matmul example #1139

Conversation

fcharras commented May 16, 2024 • edited Loading

whitneywhtsang May 17, 2024

Choose a reason for hiding this comment

whitneywhtsang commented May 17, 2024

whitneywhtsang commented May 23, 2024

whitneywhtsang commented May 23, 2024

fcharras commented May 16, 2024 •

edited

Loading