Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llama.cpp backend #231

Merged
merged 15 commits into from
Jul 30, 2024
Merged

Add llama.cpp backend #231

merged 15 commits into from
Jul 30, 2024

Conversation

baptistecolle
Copy link
Collaborator

@baptistecolle baptistecolle commented Jul 19, 2024

Add llama.cpp as Backend for Optimum Benchmark

Overview

This PR introduces llama.cpp as a backend for the Optimum benchmark (see issue #117).

Changes

  • Added an example in the examples folder demonstrating how to run the llama.cpp backend:
defaults:
  - benchmark
  - scenario: inference
  - launcher: inline
  - backend: llama_cpp
  - _base_
  - _self_

name: llama_cpp_llama

backend:
  device: mps
  model: TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
  task: text-generation
  filename: tinyllama-1.1b-chat-v1.0.Q4_0.gguf

scenario:
  input_shapes:
    batch_size: 1
    sequence_length: 256
    vocab_size: 32000
  generate_kwargs:
    max_new_tokens: 100
    min_new_tokens: 100

Current limitations:

Performance

The metrics were tested by comparing the results of the benchmark with a PyTorch backend and a llama.cpp backend. The performance results are close, and I can provide the full .json files if needed. Due to their low readability, they are not included here directly.

CLI output: (tested on M3 Pro CPU)

Performance with llama.cpp backend
Screenshot 2024-07-22 at 14 46 40
Performance with the pytorch backend
Screenshot 2024-07-22 at 14 49 42

The performance difference might be due to the significant amount of copying between devices with PyTorch, as shown below:
Screenshot 2024-07-22 at 14 50 50
Furthermore, llama.cpp is optimized for Mac, which could explain the higher performance. Let me know if you want me to investigate the performance difference further

added llama.cpp backend
@baptistecolle baptistecolle marked this pull request as ready for review July 22, 2024 13:00
@baptistecolle baptistecolle changed the title WIP: Add llama.cpp backend Add llama.cpp backend Jul 22, 2024
@baptistecolle baptistecolle changed the title Add llama.cpp backend WIP: Add llama.cpp backend Jul 22, 2024
@baptistecolle baptistecolle changed the title WIP: Add llama.cpp backend Add llama.cpp backend Jul 22, 2024
Copy link
Member

@IlyasMoutawwakil IlyasMoutawwakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot ! Very awesome work !
Only missing test configs and GitHub workflows 🤗
Hopefully I fix the process launcher with mps by the time the mps workflows start running

optimum_benchmark/backends/base.py Outdated Show resolved Hide resolved
optimum_benchmark/backends/llama_cpp/backend.py Outdated Show resolved Hide resolved
optimum_benchmark/backends/llama_cpp/backend.py Outdated Show resolved Hide resolved
optimum_benchmark/backends/llama_cpp/config.py Outdated Show resolved Hide resolved
optimum_benchmark/backends/llama_cpp/config.py Outdated Show resolved Hide resolved
optimum_benchmark/import_utils.py Outdated Show resolved Hide resolved
optimum_benchmark/task_utils.py Outdated Show resolved Hide resolved
@baptistecolle baptistecolle changed the title Add llama.cpp backend WIP: Add llama.cpp backend Jul 23, 2024
Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean PR!
I left a couple comments.

Additionally, since benchmarking for llama.cpp is limited to a batch size of 1, I think it may be good to add a comment right above the batch size field in the 3 example configs. Unless an error/warning is raised somewhere in the code but I didn't see it.

optimum_benchmark/backends/llama_cpp/config.py Outdated Show resolved Hide resolved
optimum_benchmark/backends/llama_cpp/backend.py Outdated Show resolved Hide resolved
@baptistecolle
Copy link
Collaborator Author

baptistecolle commented Jul 25, 2024

Very clean PR! I left a couple comments.

Additionally, since benchmarking for llama.cpp is limited to a batch size of 1, I think it may be good to add a comment right above the batch size field in the 3 example configs. Unless an error/warning is raised somewhere in the code but I didn't see it.

It is done here
https://github.com/huggingface/optimum-benchmark/pull/231/files#r1690885402

@baptistecolle
Copy link
Collaborator Author

baptistecolle commented Jul 25, 2024

Thanks for the review. I implemented the necessary changes

I added:

  • fix different formatting issue
  • structure of the code, absolute import, forgotten debug print statement...)
  • added support for embedding models with llama.cpp
  • added tests and github workflow for llama.cpp backend

(also two of the runners are currently offline, so i am unable to run the CI on them)

  • API ROCm Tests / build_image_and_run_api_rocm_tests (pull_request)
  • CLI ROCm Pytorch Single-GPU Tests / run_cli_rocm_pytorch_single_gpu_tests (pull_request)

Let me know if you have more remarks

@baptistecolle baptistecolle changed the title WIP: Add llama.cpp backend Add llama.cpp backend Jul 25, 2024
@IlyasMoutawwakil
Copy link
Member

one or two examples are enough, there's a lot of repetition there

@baptistecolle
Copy link
Collaborator Author

Indeed, I created multiple config during development and forgot to remove it from the pr. I fixed it now

@IlyasMoutawwakil
Copy link
Member

Great PR @baptistecolle 🤗

@IlyasMoutawwakil IlyasMoutawwakil merged commit 0aac010 into main Jul 30, 2024
25 of 27 checks passed
@baptistecolle baptistecolle deleted the llama_cpp branch July 31, 2024 10:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants