-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matmul nbits to optimize memory layout for avx instructions #22203
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can commit the suggested changes from lintrunner.
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
…hus not to implement avx512 Signed-off-by: liqunfu <liqun.fu@microsoft.com>
… to be in a separate loop. defer this work later Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun_fu@hotmail.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
@@ -55,6 +55,7 @@ | |||
__m512i sum_16_epi32 = _mm512_madd_epi16(one_32_epi16, sum_32_epi16); | |||
__m512 sum_16_ps = _mm512_cvtepi32_ps(sum_16_epi32); | |||
acc = _mm512_fmadd_ps(sum_16_ps, _mm512_set1_ps(combined_scale), acc); | |||
// acc = _mm512_fmadd_ps(sum_16_ps, load_broadcast_512(combined_scale), acc); |
Check notice
Code scanning / CodeQL
Commented-out code Note
// folowing 2 lines do the same with close perf (more latency count). | ||
// it requires CPUID Flags: AVX512DQ which is more restricted | ||
// const __m256 scale_b_ps = _mm256_castpd_ps(_mm256_broadcast_sd(combined_scale)); | ||
// const __m512 scale_b_16_ps = _mm512_broadcast_f32x8(scale_b_ps); | ||
// return; |
Check notice
Code scanning / CodeQL
Commented-out code Note
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
…will not compile on Cuda CI Signed-off-by: liqunfu <liqun.fu@microsoft.com>
The main purpose of this PR is to remove sqnbit's dependency on sgemm in x86/x64 cases. The benefit is a cleaner memory layout not requiring memory alignment, no need for the Rows by 16-bytes memory layout required by sgemm. It also offers slight performance improvement.
A second improvement in the PR is to reduce memory footprint by fully packing zero point and scales. There is no need for these inputs after they are packed with weights.
The following performance data is to show that new code does not downgrade performance (if not improve):
Avx2 M=1, Asymmetric:
Avx2 M=1, Symmetric:
Avx2 M=128, Asymmetric:
Avx2 M=128, Symmetric:
Avx512vnni M=1, Asymmetric:
Avx512vnni M=1, Symmetric:
Avx512vnni M=128, Asymmetric:
Avx512vnni M=128, Symmetric:
Avx512 M=1, Asymmetric:
Avx512 M=1, Symmetric:
Avx512 M=128, Asymmetric:
Avx512 M=128, Symmetric:
Avx2vnni M=1, Asymmetric:
Avx2vnni M=1, Symmetric:
Avx2vnni M=128, Asymmetric:
Avx2vnni M=128, Symmetric: