Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matmul nbits to optimize memory layout for avx instructions #22203

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

liqunfu
Copy link
Contributor

@liqunfu liqunfu commented Sep 24, 2024

The main purpose of this PR is to remove sqnbit's dependency on sgemm in x86/x64 cases. The benefit is a cleaner memory layout not requiring memory alignment, no need for the Rows by 16-bytes memory layout required by sgemm. It also offers slight performance improvement.
A second improvement in the PR is to reduce memory footprint by fully packing zero point and scales. There is no need for these inputs after they are packed with weights.

The following performance data is to show that new code does not downgrade performance (if not improve):

Avx2 M=1, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 35514 21530
64 36188 18863
128 30303 21186
256 32058 17880

Avx2 M=1, Symmetric:

blklen baseline time (ns) updated time (ns)
32 25863 19933
64 25610 22487
128 27239 20008
256 24154 21795

Avx2 M=128, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 1903033 1858414
64 1786323 1819076
128 1884952 1790135
256 1906534 1706993

Avx2 M=128, Symmetric:

blklen baseline time (ns) updated time (ns)
32 1777207 1897442
64 1833315 1805860
128 1689521 1735043
256 1685658 1652083

Avx512vnni M=1, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 22733 23498
64 22144 23345
128 19368 17810
256 19318 18823

Avx512vnni M=1, Symmetric:

blklen baseline time (ns) updated time (ns)
32 22410 28872
64 24994 23917
128 65785 65160
256 20412 20629

Avx512vnni M=128, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 1616597 1355684
64 1453165 1464413
128 1116153 1093754
256 959254 989052

Avx512vnni M=128, Symmetric:

blklen baseline time (ns) updated time (ns)
32 1603280 1387044
64 1421595 1459699
128 1110027 1061157
256 933319 965465

Avx512 M=1, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 23598 24242
64 22564 22820
128 21043 26688
256 22333 21199

Avx512 M=1, Symmetric:

blklen baseline time (ns) updated time (ns)
32 23520 25145
64 52621 23752
128 30848 21809
256 20594 21390

Avx512 M=128, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 1653963 1598588
64 1635840 1579680
128 1633040 1595919
256 1461328 1464798

Avx512 M=128, Symmetric:

blklen baseline time (ns) updated time (ns)
32 1755299 1633517
64 1608648 1569993
128 1648288 1688076
256 1454290 1482201

Avx2vnni M=1, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 21642 12166
64 19835 10350
128 20185 9565
256 19356 10586

Avx2vnni M=1, Symmetric:

blklen baseline time (ns) updated time (ns)
32 15515 12744
64 15347 10068
128 16598 9409
256 17510 9833

Avx2vnni M=128, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 1040664 1105827
64 832389 859634
128 815307 819965
256 809460 823504

Avx2vnni M=128, Symmetric:

blklen baseline time (ns) updated time (ns)
32 1039106 1066090
64 874908 860423
128 815173 818668
256 819842 809170

Signed-off-by: liqunfu <liqun.fu@microsoft.com>
@liqunfu liqunfu requested a review from a team as a code owner September 24, 2024 15:50
@liqunfu liqunfu marked this pull request as draft September 24, 2024 15:50
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

onnxruntime/test/contrib_ops/matmul_4bits_test.cc Outdated Show resolved Hide resolved
}

TEST(MatMulNBits, LongTestFloat32) {
// onnxruntime::profiling::Profiler::Profiler::Instance().StartProfiling<char>("profile.json");

Check notice

Code scanning / CodeQL

Commented-out code Note test

This comment appears to contain commented-out code.
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
…hus not to implement avx512

Signed-off-by: liqunfu <liqun.fu@microsoft.com>
… to be in a separate loop. defer this work later

Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
liqunfu and others added 8 commits December 13, 2024 10:09
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun_fu@hotmail.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
@@ -55,6 +55,7 @@
__m512i sum_16_epi32 = _mm512_madd_epi16(one_32_epi16, sum_32_epi16);
__m512 sum_16_ps = _mm512_cvtepi32_ps(sum_16_epi32);
acc = _mm512_fmadd_ps(sum_16_ps, _mm512_set1_ps(combined_scale), acc);
// acc = _mm512_fmadd_ps(sum_16_ps, load_broadcast_512(combined_scale), acc);

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
Comment on lines +52 to +56
// folowing 2 lines do the same with close perf (more latency count).
// it requires CPUID Flags: AVX512DQ which is more restricted
// const __m256 scale_b_ps = _mm256_castpd_ps(_mm256_broadcast_sd(combined_scale));
// const __m512 scale_b_16_ps = _mm512_broadcast_f32x8(scale_b_ps);
// return;

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
…will not compile on Cuda CI

Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant