[Performance] `A@B.T` - The GEMM performance with the column major B matrix is not as good as row major B matrix. #2354

chengjunlu · 2024-09-26T00:09:55Z

The performance gap is found in #2347

Need to investigate root cause of the performance drops of the column major B matrix case.
Roughly 1.5x worse than the row major B matrix case.

(I): Detected 7680 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 0 spills
✅ Triton and Torch match
Time for torch: 0.31633758544921875 ms
Time for triton: 0.44517597556114197 ms
Compute A x B.T
OpenCL API not available for this operation
OpenCL API not available for this operation
OpenCL API not available for this operation
OpenCL API not available for this operation
(I): Detected 7680 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 0 spills
✅ Triton and Torch match
Time for torch: 0.3375360071659088 ms
Time for triton: 0.6348815560340881 ms

Egor-Krivov · 2024-10-04T08:44:10Z

I think this issue is essential for GEMM perf. Very often weights are stored with K dimensions as the last. Even pytorch linear layer does that: weight torch.Tensor – the learnable weights of the module of shape : (out_features, in_features)

https://pytorch.org/docs/stable/generated/torch.nn.Linear.html

alexbaden · 2024-10-11T02:16:16Z

Adding to this, if the A matrix is column-major we have similar problems.

Egor-Krivov · 2024-10-11T13:33:52Z

We now have microbenchmarks to track this performance. Currently GeoMean for onednn is ~90-100TFLOPs for both cases of A.T@B and for A@B.T.

A@B.T for triton currently stands at ~60TFLOPs. Dashboard gemm-bt
A.T@B for triton currently stands at ~30TFLOPs, it significantly improved and was ~15TFLOPs recently. Dashboard gemm-at

So onednn is 1.5 times faster for B.T and 3 times faster for A.T

Egor-Krivov · 2024-10-11T13:35:04Z

@alexbaden Should we change the title to reflect issue with A.T as well or create separate issue for that case?

alexbaden · 2024-11-16T13:50:09Z

Current Triton tiling strategy for DPAS for AxBT:

oneDNN tiling strategy mapped to Triton (thanks to @Jianhui-Li and @chengjunlu ) :

I plan to try to implement the oneDNN strategy in Triton.

…2956) Required for #2834 Two reasons to do this - one, it properly tags the layouts with their memory order very early in the TTGIR pipeline. And two, it moves our TTGIR pipeline closer to upstream. I am splitting the change to isolate any regressions or undesired behavior caused by this change vs changing the DPAS layouts in #2834. cc #2354

chengjunlu mentioned this issue Sep 26, 2024

Improve GEMM perf when one matrix is transposed #2347

Merged

vlad-penkin added performance enhancement New feature or request labels Sep 27, 2024

vlad-penkin added this to the 4.0 [Performance] Core milestone Sep 27, 2024

Egor-Krivov mentioned this issue Oct 4, 2024

[Benchmarks] Add microbenchmark with A@B^t #2414

Closed

alexbaden mentioned this issue Oct 11, 2024

[GEMM-perf] matmul is slower when one input needs to be transposed #1795

Closed

vlad-penkin assigned alexbaden Oct 15, 2024

vlad-penkin changed the title ~~[Performance] The GEMM performance with the column major B matrix is not as good as row major B matrix.~~ [Performance] A@B.T - The GEMM performance with the column major B matrix is not as good as row major B matrix. Oct 21, 2024

alexbaden linked a pull request Nov 26, 2024 that will close this issue

Use order from A matrix when determining DPAS layout #2834

Draft

alexbaden mentioned this issue Dec 6, 2024

Permute the pass pipeline to coalesce before setting up the matmul #2956

Merged

vlad-penkin added the umbrella label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] `A@B.T` - The GEMM performance with the column major B matrix is not as good as row major B matrix. #2354

[Performance] `A@B.T` - The GEMM performance with the column major B matrix is not as good as row major B matrix. #2354

chengjunlu commented Sep 26, 2024

Egor-Krivov commented Oct 4, 2024

alexbaden commented Oct 11, 2024

Egor-Krivov commented Oct 11, 2024 •

edited

Loading

Egor-Krivov commented Oct 11, 2024

alexbaden commented Nov 16, 2024

[Performance] A@B.T - The GEMM performance with the column major B matrix is not as good as row major B matrix. #2354

[Performance] A@B.T - The GEMM performance with the column major B matrix is not as good as row major B matrix. #2354

Comments

chengjunlu commented Sep 26, 2024

Egor-Krivov commented Oct 4, 2024

alexbaden commented Oct 11, 2024

Egor-Krivov commented Oct 11, 2024 • edited Loading

Egor-Krivov commented Oct 11, 2024

alexbaden commented Nov 16, 2024

[Performance] `A@B.T` - The GEMM performance with the column major B matrix is not as good as row major B matrix. #2354

[Performance] `A@B.T` - The GEMM performance with the column major B matrix is not as good as row major B matrix. #2354

Egor-Krivov commented Oct 11, 2024 •

edited

Loading