Matmul nbits to optimize memory layout for avx instructions #22203

liqunfu · 2024-09-24T15:50:37Z

The main purpose of this PR is to remove sqnbit's dependency on sgemm in x86/x64 cases. The benefit is a cleaner memory layout not requiring memory alignment, no need for the Rows by 16-bytes memory layout required by sgemm. It also offers slight performance improvement.
A second improvement in the PR is to reduce memory footprint by fully packing zero point and scales. There is no need for these inputs after they are packed with weights.

The following performance data is to show that new code does not downgrade performance (if not improve):

Avx2 M=1, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	35514	21530
64	36188	18863
128	30303	21186
256	32058	17880

Avx2 M=1, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	25863	19933
64	25610	22487
128	27239	20008
256	24154	21795

Avx2 M=128, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	`1903033`	1858414
64	1786323	1819076
128	1884952	1790135
256	1906534	1706993

Avx2 M=128, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	1777207	1897442
64	1833315	`1805860`
128	1689521	1735043
256	`1685658`	1652083

Avx512vnni M=1, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	22733	23498
64	22144	23345
128	19368	17810
256	19318	18823

Avx512vnni M=1, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	22410	28872
64	24994	23917
128	65785	65160
256	20412	20629

Avx512vnni M=128, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	1616597	1355684
64	1453165	1464413
128	1116153	1093754
256	959254	989052

Avx512vnni M=128, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	1603280	1387044
64	1421595	`1459699`
128	1110027	1061157
256	933319	965465

Avx512 M=1, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	23598	24242
64	22564	22820
128	21043	26688
256	22333	21199

Avx512 M=1, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	23520	25145
64	52621	23752
128	30848	21809
256	20594	21390

Avx512 M=128, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	1653963	1598588
64	1635840	1579680
128	1633040	1595919
256	1461328	1464798

Avx512 M=128, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	1755299	1633517
64	1608648	1569993
128	1648288	1688076
256	1454290	1482201

Avx2vnni M=1, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	21642	12166
64	19835	10350
128	20185	9565
256	19356	10586

Avx2vnni M=1, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	15515	12744
64	15347	10068
128	16598	9409
256	17510	9833

Avx2vnni M=128, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	1040664	1105827
64	832389	859634
128	815307	819965
256	809460	823504

Avx2vnni M=128, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	1039106	1066090
64	874908	860423
128	815173	818668
256	819842	809170

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/contrib_ops/matmul_4bits_test.cc

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx512.cpp

onnxruntime/test/contrib_ops/matmul_4bits_test.cc

+}
+
+TEST(MatMulNBits, LongTestFloat32) {
+  // onnxruntime::profiling::Profiler::Profiler::Instance().StartProfiling<char>("profile.json");


Signed-off-by: liqunfu <liqun.fu@microsoft.com>

…hus not to implement avx512 Signed-off-by: liqunfu <liqun.fu@microsoft.com>

… to be in a separate loop. defer this work later Signed-off-by: liqunfu <liqun.fu@microsoft.com>

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx512_int8_blklen32.h

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx512_int8_blklen64.h

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

Signed-off-by: Liqun Fu <liqun_fu@hotmail.com>

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx512_int8_blklen128.h

@@ -55,6 +55,7 @@
    __m512i sum_16_epi32 = _mm512_madd_epi16(one_32_epi16, sum_32_epi16);
    __m512 sum_16_ps = _mm512_cvtepi32_ps(sum_16_epi32);
    acc = _mm512_fmadd_ps(sum_16_ps, _mm512_set1_ps(combined_scale), acc);
+    // acc = _mm512_fmadd_ps(sum_16_ps, load_broadcast_512(combined_scale), acc);


onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx512_int8_blklen64.h

+    // folowing 2 lines do the same with close perf (more latency count).
+    // it requires CPUID Flags: AVX512DQ which is more restricted
+    // const __m256 scale_b_ps = _mm256_castpd_ps(_mm256_broadcast_sd(combined_scale));
+    // const __m512 scale_b_16_ps = _mm512_broadcast_f32x8(scale_b_ps);
+    // return;


Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

…will not compile on Cuda CI Signed-off-by: liqunfu <liqun.fu@microsoft.com>

matmul nbits to optimize memory layout for avx instructions

555e951

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

liqunfu requested a review from a team as a code owner September 24, 2024 15:50

liqunfu marked this pull request as draft September 24, 2024 15:50

liqunfu added 2 commits November 7, 2024 21:28

Merge branch 'main' into liqun/avx-layout

076998c

intermediate push

99aec95

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-actions bot reviewed Nov 18, 2024

View reviewed changes

onnxruntime/test/contrib_ops/matmul_4bits_test.cc Outdated Show resolved Hide resolved

liqunfu added 2 commits November 27, 2024 02:38

pass mlas and utest for blklen32 avx512

8ce1a2a

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

Merge branch 'main' into liqun/avx-layout

f016555

github-advanced-security bot found potential problems Nov 27, 2024

View reviewed changes

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx512.cpp Fixed Show fixed Hide fixed

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx512.cpp Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Nov 27, 2024

View reviewed changes

liqunfu added 7 commits November 29, 2024 00:43

pass avx512/vnni-blklen32

d371c59

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

pass avx512vnni-blklen128. plan to compute blksum in different loop t…

790b03f

…hus not to implement avx512 Signed-off-by: liqunfu <liqun.fu@microsoft.com>

attmpt to make blklen256 work. failed because blksum computation need…

557fbb0

… to be in a separate loop. defer this work later Signed-off-by: liqunfu <liqun.fu@microsoft.com>

avx512 blklen64 to compute blksum in a separate loop

6b28657

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

avx512 scaled_zp compute in a separate loop except blklen16

0b867f8

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

avx512, all blklens, scaled_zp compute in a separate loop

2e74f56

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

Merge branch 'main' into liqun/avx-layout

0bf47f7

github-advanced-security bot found potential problems Dec 12, 2024

View reviewed changes

liqunfu and others added 8 commits December 13, 2024 10:09

avx2 passes

c19ae9e

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

avxvnni, matmul_nbit kernel

b26b075

Signed-off-by: Liqun Fu <liqun_fu@hotmail.com>

mlas nbit print correct compType

7e99d50

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

clean up a bit

f36ec96

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

Merge branch 'main' into liqun/avx-layout

6d0404f

lint

5901b52

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

remove unused __m512 load_1blksum_512(const float* BlksumPtr)

eba1908

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

Merge branch 'main' into liqun/avx-layout

e8484eb

github-advanced-security bot found potential problems Jan 10, 2025

View reviewed changes

liqunfu added 3 commits January 10, 2025 15:08

sqnbitgemm_kernel_avx512.cpp to apply -mavx512f

6dac6ad

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

undo sqnbitgemm_kernel_avx512.cpp to apply -mavx512f

429054a

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

restore avx512 blklen32 from use special layout because related code …

b1d7474

…will not compile on Cuda CI Signed-off-by: liqunfu <liqun.fu@microsoft.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matmul nbits to optimize memory layout for avx instructions #22203

Matmul nbits to optimize memory layout for avx instructions #22203

liqunfu commented Sep 24, 2024 •

edited

Loading

github-actions bot left a comment

Matmul nbits to optimize memory layout for avx instructions #22203

Are you sure you want to change the base?

Matmul nbits to optimize memory layout for avx instructions #22203

Conversation

liqunfu commented Sep 24, 2024 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

liqunfu commented Sep 24, 2024 •

edited

Loading