Optimized matrix multiplications for i-quants on __aarch64__ (#464) · Mozilla-Ocho/llamafile@c38feb4

Commit

Optimized matrix multiplications for i-quants on __aarch64__ (#464)

* Arm for i-quants

This carries over what I had done within llama.cpp.
In llamafile we have nice performance gains for PP, but
we get performance regression for TG.
For now, just adjusted iq2_xxs to also outperform in TG
(~10% beter @ 4 and 8 threads).
Will tackle the other quants next.

* Arm for i-quants: iq2_xxs

So, improving TG speed results in a drop of performance for PP.
Before I had PP-512 = 56.78 t/s, TG-128 = 12.42 t/s @ 8 threads.
Now we have  PP-512 = 52.77 t/s, TG-128 = 15.97 t/s @ 8 threads.

* Arm for i-quants: iq3_s

Improved TG from 4.96 t/s yto 5.43 t/s. Still ~3.5$ slower
than mainline.
PP-512 became slightly better (47.9 vs 46.8 t/s).
This is 3.9X mainline (!)

* Arm for i-quants: iq3_xxs

PP stays the same - 3.67X mainline.
TG improves slightly to 5.05 t/s from 4.74 t/s @ 4 threads.
This is still 15% slower than mainline.

* Arm for i-quants: iq2_s

We get 3.32X mainline for PP.
TG is, sadly, 0.92X @ 4 threads

* Arm for i-quants: iq2_xs

We get 2.87X mainline for PP.
TG is, sadly, 0.95X @ 4 threads

* Arm for i-quants: abandoning special-casing Ny = 1

* Arm for i-quants: cleanup and disable iqk_mul_mat for Ny = 1

* Arm for i-quants: holding the compiler's hand

Turns out we can improve quite a bit by explicitely
asking the compiler to never inline some functions, and
to always inline some other.
With that, PP performance gains are > 3X for all i-quants,
reacing 4.3X for iq3_s. TG is also always better, except
for iq3_xxs, where it is 0.99X, so re-enabled iql_mul_mat
for Ny = 1.

* Arm for i-quants: iterating

Turns out changing one method of a quant affects the
performance of other qunts(s). Is the compiler somehow
trying to optimize all template instantiations together?
Anyway, with this version I have this:
|                     cpu_info | model_filename |       size |    test |     t/s |
| ---------------------------: | -------------: | ---------: | ------: | ------: |
| Apple M2 Max (+fp16+dotprod) |         iq2xxs |   1.73 GiB |   tg128 |    9.02 |
| Apple M2 Max (+fp16+dotprod) |         iq2xxs |   1.73 GiB |   pp512 |   61.31 |
| Apple M2 Max (+fp16+dotprod) |          iq2xs |   1.89 GiB |   tg128 |   10.58 |
| Apple M2 Max (+fp16+dotprod) |          iq2xs |   1.89 GiB |   pp512 |   56.11 |
| Apple M2 Max (+fp16+dotprod) |           iq2m |   2.20 GiB |   tg128 |    7.07 |
| Apple M2 Max (+fp16+dotprod) |           iq2m |   2.20 GiB |   pp512 |   45.78 |
| Apple M2 Max (+fp16+dotprod) |         iq3xxs |   2.41 GiB |   tg128 |    6.40 |
| Apple M2 Max (+fp16+dotprod) |         iq3xxs |   2.41 GiB |   pp512 |   47.51 |
| Apple M2 Max (+fp16+dotprod) |           iq3m |   2.90 GiB |   tg128 |    5.97 |
| Apple M2 Max (+fp16+dotprod) |           iq3m |   2.90 GiB |   pp512 |   47.98 |

TG is with 4 threads, PP with 8.

* Arm for i-quants: iterating

With this version we get
|                     cpu_info | model_filename |       size |   test |     t/s |
| ---------------------------: | -------------: | ---------: | -----: | ------: |
| Apple M2 Max (+fp16+dotprod) |         iq2xxs |   1.73 GiB |  tg128 |   10.83 |
| Apple M2 Max (+fp16+dotprod) |         iq2xxs |   1.73 GiB |  pp512 |   60.82 |
| Apple M2 Max (+fp16+dotprod) |          iq2xs |   1.89 GiB |  tg128 |   10.79 |
| Apple M2 Max (+fp16+dotprod) |          iq2xs |   1.89 GiB |  pp512 |   57.10 |
| Apple M2 Max (+fp16+dotprod) |           iq2m |   2.20 GiB |  tg128 |    7.45 |
| Apple M2 Max (+fp16+dotprod) |           iq2m |   2.20 GiB |  pp512 |   46.39 |
| Apple M2 Max (+fp16+dotprod) |         iq3xxs |   2.41 GiB |  tg128 |    6.77 |
| Apple M2 Max (+fp16+dotprod) |         iq3xxs |   2.41 GiB |  pp512 |   48.74 |
| Apple M2 Max (+fp16+dotprod) |           iq3m |   2.90 GiB |  tg128 |    5.97 |
| Apple M2 Max (+fp16+dotprod) |           iq3m |   2.90 GiB |  pp512 |   48.59 |

* Arm for i-quants: cleanup and comments

* Remove forgotten experimental change in q3_K implementation

Loading branch information

ikawrakow authored Jun 8, 2024

1 parent 842a421 commit c38feb4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `c38feb4`

Commit

There are no files selected for viewing

0 comments on commit c38feb4

0 comments on commit `c38feb4`