Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized matrix multiplications for i-quants on __aarch64__ #464

Merged
merged 13 commits into from
Jun 8, 2024

Conversation

ikawrakow
Copy link
Contributor

i-quants offer better quantization quality than k-quants in the 2- and 3-bpw range, but are notoriously slow on the CPU. This PR brings a significant speedup on Arm CPU's, particularly for prompt processing. Performance is still lower than k-quants, but the performance gap is now substantially smaller.

The following table compares performance between the main branch and this PR for a 7B LLaMA model on an M2 Max CPU.

cpu_info model_filename size threads test t/s (main) t/s (PR) Speedup
M2 Max (+fp16+dotprod) iq2xxs 1.73 GiB 8 pp512 16.50 61.16 3.707
M2 Max (+fp16+dotprod) iq2xs 1.89 GiB 8 pp512 19.09 57.42 3.008
M2 Max (+fp16+dotprod) iq2m 2.20 GiB 8 pp512 13.32 46.37 3.481
M2 Max (+fp16+dotprod) iq3xxs 2.41 GiB 8 pp512 12.30 48.60 3.951
M2 Max (+fp16+dotprod) iq3m 2.90 GiB 8 pp512 12.11 49.70 4.104
M2 Max (+fp16+dotprod) iq2xxs 1.73 GiB 4 tg128 7.73 11.03 1.427
M2 Max (+fp16+dotprod) iq2xxs 1.73 GiB 8 tg128 14.64 20.09 1.372
M2 Max (+fp16+dotprod) iq2xs 1.89 GiB 4 tg128 8.56 10.72 1.252
M2 Max (+fp16+dotprod) iq2xs 1.89 GiB 8 tg128 16.17 19.91 1.231
M2 Max (+fp16+dotprod) iq2m 2.20 GiB 4 tg128 6.34 7.44 1.174
M2 Max (+fp16+dotprod) iq2m 2.20 GiB 8 tg128 12.03 13.60 1.106
M2 Max (+fp16+dotprod) iq3xxs 2.41 GiB 4 tg128 5.98 6.78 1.134
M2 Max (+fp16+dotprod) iq3xxs 2.41 GiB 8 tg128 10.93 11.94 1.092
M2 Max (+fp16+dotprod) iq3m 2.90 GiB 4 tg128 5.62 5.95 1.059
M2 Max (+fp16+dotprod) iq3m 2.90 GiB 8 tg128 10.39 10.71 1.031

This carries over what I had done within llama.cpp.
In llamafile we have nice performance gains for PP, but
we get performance regression for TG.
For now, just adjusted iq2_xxs to also outperform in TG
(~10% beter @ 4 and 8 threads).
Will tackle the other quants next.
So, improving TG speed results in a drop of performance for PP.
Before I had PP-512 = 56.78 t/s, TG-128 = 12.42 t/s @ 8 threads.
Now we have  PP-512 = 52.77 t/s, TG-128 = 15.97 t/s @ 8 threads.
Improved TG from 4.96 t/s yto 5.43 t/s. Still ~3.5$ slower
than mainline.
PP-512 became slightly better (47.9 vs 46.8 t/s).
This is 3.9X mainline (!)
PP stays the same - 3.67X mainline.
TG improves slightly to 5.05 t/s from 4.74 t/s @ 4 threads.
This is still 15% slower than mainline.
We get 3.32X mainline for PP.
TG is, sadly, 0.92X @ 4 threads
We get 2.87X mainline for PP.
TG is, sadly, 0.95X @ 4 threads
Turns out we can improve quite a bit by explicitely
asking the compiler to never inline some functions, and
to always inline some other.
With that, PP performance gains are > 3X for all i-quants,
reacing 4.3X for iq3_s. TG is also always better, except
for iq3_xxs, where it is 0.99X, so re-enabled iql_mul_mat
for Ny = 1.
Turns out changing one method of a quant affects the
performance of other qunts(s). Is the compiler somehow
trying to optimize all template instantiations together?
Anyway, with this version I have this:
|                     cpu_info | model_filename |       size |    test |     t/s |
| ---------------------------: | -------------: | ---------: | ------: | ------: |
| Apple M2 Max (+fp16+dotprod) |         iq2xxs |   1.73 GiB |   tg128 |    9.02 |
| Apple M2 Max (+fp16+dotprod) |         iq2xxs |   1.73 GiB |   pp512 |   61.31 |
| Apple M2 Max (+fp16+dotprod) |          iq2xs |   1.89 GiB |   tg128 |   10.58 |
| Apple M2 Max (+fp16+dotprod) |          iq2xs |   1.89 GiB |   pp512 |   56.11 |
| Apple M2 Max (+fp16+dotprod) |           iq2m |   2.20 GiB |   tg128 |    7.07 |
| Apple M2 Max (+fp16+dotprod) |           iq2m |   2.20 GiB |   pp512 |   45.78 |
| Apple M2 Max (+fp16+dotprod) |         iq3xxs |   2.41 GiB |   tg128 |    6.40 |
| Apple M2 Max (+fp16+dotprod) |         iq3xxs |   2.41 GiB |   pp512 |   47.51 |
| Apple M2 Max (+fp16+dotprod) |           iq3m |   2.90 GiB |   tg128 |    5.97 |
| Apple M2 Max (+fp16+dotprod) |           iq3m |   2.90 GiB |   pp512 |   47.98 |

TG is with 4 threads, PP with 8.
With this version we get
|                     cpu_info | model_filename |       size |   test |     t/s |
| ---------------------------: | -------------: | ---------: | -----: | ------: |
| Apple M2 Max (+fp16+dotprod) |         iq2xxs |   1.73 GiB |  tg128 |   10.83 |
| Apple M2 Max (+fp16+dotprod) |         iq2xxs |   1.73 GiB |  pp512 |   60.82 |
| Apple M2 Max (+fp16+dotprod) |          iq2xs |   1.89 GiB |  tg128 |   10.79 |
| Apple M2 Max (+fp16+dotprod) |          iq2xs |   1.89 GiB |  pp512 |   57.10 |
| Apple M2 Max (+fp16+dotprod) |           iq2m |   2.20 GiB |  tg128 |    7.45 |
| Apple M2 Max (+fp16+dotprod) |           iq2m |   2.20 GiB |  pp512 |   46.39 |
| Apple M2 Max (+fp16+dotprod) |         iq3xxs |   2.41 GiB |  tg128 |    6.77 |
| Apple M2 Max (+fp16+dotprod) |         iq3xxs |   2.41 GiB |  pp512 |   48.74 |
| Apple M2 Max (+fp16+dotprod) |           iq3m |   2.90 GiB |  tg128 |    5.97 |
| Apple M2 Max (+fp16+dotprod) |           iq3m |   2.90 GiB |  pp512 |   48.59 |
Copy link
Collaborator

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'm happy to see more ARM improvements. To support your work, I've been focusing on getting llamafile to run on Android these past few days. ARM just said 70% of inference on Android happens on CPU, so it's potentially the most impactful audience for your work.. https://www.theregister.com/2024/05/30/arm_cortex_x925_ai_cores/?td=rt-3a

@jart jart merged commit c38feb4 into Mozilla-Ocho:main Jun 8, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants