Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Optimized matrix multiplications for i-quants on __aarch64__ (#464)
* Arm for i-quants This carries over what I had done within llama.cpp. In llamafile we have nice performance gains for PP, but we get performance regression for TG. For now, just adjusted iq2_xxs to also outperform in TG (~10% beter @ 4 and 8 threads). Will tackle the other quants next. * Arm for i-quants: iq2_xxs So, improving TG speed results in a drop of performance for PP. Before I had PP-512 = 56.78 t/s, TG-128 = 12.42 t/s @ 8 threads. Now we have PP-512 = 52.77 t/s, TG-128 = 15.97 t/s @ 8 threads. * Arm for i-quants: iq3_s Improved TG from 4.96 t/s yto 5.43 t/s. Still ~3.5$ slower than mainline. PP-512 became slightly better (47.9 vs 46.8 t/s). This is 3.9X mainline (!) * Arm for i-quants: iq3_xxs PP stays the same - 3.67X mainline. TG improves slightly to 5.05 t/s from 4.74 t/s @ 4 threads. This is still 15% slower than mainline. * Arm for i-quants: iq2_s We get 3.32X mainline for PP. TG is, sadly, 0.92X @ 4 threads * Arm for i-quants: iq2_xs We get 2.87X mainline for PP. TG is, sadly, 0.95X @ 4 threads * Arm for i-quants: abandoning special-casing Ny = 1 * Arm for i-quants: cleanup and disable iqk_mul_mat for Ny = 1 * Arm for i-quants: holding the compiler's hand Turns out we can improve quite a bit by explicitely asking the compiler to never inline some functions, and to always inline some other. With that, PP performance gains are > 3X for all i-quants, reacing 4.3X for iq3_s. TG is also always better, except for iq3_xxs, where it is 0.99X, so re-enabled iql_mul_mat for Ny = 1. * Arm for i-quants: iterating Turns out changing one method of a quant affects the performance of other qunts(s). Is the compiler somehow trying to optimize all template instantiations together? Anyway, with this version I have this: | cpu_info | model_filename | size | test | t/s | | ---------------------------: | -------------: | ---------: | ------: | ------: | | Apple M2 Max (+fp16+dotprod) | iq2xxs | 1.73 GiB | tg128 | 9.02 | | Apple M2 Max (+fp16+dotprod) | iq2xxs | 1.73 GiB | pp512 | 61.31 | | Apple M2 Max (+fp16+dotprod) | iq2xs | 1.89 GiB | tg128 | 10.58 | | Apple M2 Max (+fp16+dotprod) | iq2xs | 1.89 GiB | pp512 | 56.11 | | Apple M2 Max (+fp16+dotprod) | iq2m | 2.20 GiB | tg128 | 7.07 | | Apple M2 Max (+fp16+dotprod) | iq2m | 2.20 GiB | pp512 | 45.78 | | Apple M2 Max (+fp16+dotprod) | iq3xxs | 2.41 GiB | tg128 | 6.40 | | Apple M2 Max (+fp16+dotprod) | iq3xxs | 2.41 GiB | pp512 | 47.51 | | Apple M2 Max (+fp16+dotprod) | iq3m | 2.90 GiB | tg128 | 5.97 | | Apple M2 Max (+fp16+dotprod) | iq3m | 2.90 GiB | pp512 | 47.98 | TG is with 4 threads, PP with 8. * Arm for i-quants: iterating With this version we get | cpu_info | model_filename | size | test | t/s | | ---------------------------: | -------------: | ---------: | -----: | ------: | | Apple M2 Max (+fp16+dotprod) | iq2xxs | 1.73 GiB | tg128 | 10.83 | | Apple M2 Max (+fp16+dotprod) | iq2xxs | 1.73 GiB | pp512 | 60.82 | | Apple M2 Max (+fp16+dotprod) | iq2xs | 1.89 GiB | tg128 | 10.79 | | Apple M2 Max (+fp16+dotprod) | iq2xs | 1.89 GiB | pp512 | 57.10 | | Apple M2 Max (+fp16+dotprod) | iq2m | 2.20 GiB | tg128 | 7.45 | | Apple M2 Max (+fp16+dotprod) | iq2m | 2.20 GiB | pp512 | 46.39 | | Apple M2 Max (+fp16+dotprod) | iq3xxs | 2.41 GiB | tg128 | 6.77 | | Apple M2 Max (+fp16+dotprod) | iq3xxs | 2.41 GiB | pp512 | 48.74 | | Apple M2 Max (+fp16+dotprod) | iq3m | 2.90 GiB | tg128 | 5.97 | | Apple M2 Max (+fp16+dotprod) | iq3m | 2.90 GiB | pp512 | 48.59 | * Arm for i-quants: cleanup and comments * Remove forgotten experimental change in q3_K implementation
- Loading branch information