feat: Add load tests #2217

Hugoch · 2024-07-11T09:45:51Z

What does this PR do?

This PR adds automated load tests in CI using Grafana k6.

Two tests are performed:

Constant virtual users (VUs) load test: It simulates a fixed pool of users trying to make as many requests as possible on API during 60 seconds.
Constant arrival rate load test: It simulates a constant rate of user requests arrival, independent of the system’s response rate, during 60 seconds.

Both tests were run for 2 different kinds of inputs:

5000 ShareGPT prompts randomly selected (variable token length)
5000 ShareGPT prompts truncated to 500 tokens (constant token length). Tokens count usesllama-tokenizer

Test compute the following metrics:

Inter token latency: Time to generate a new output token for each user that is querying the system. It translates as the “speed” perceived by the end-user. We aim for at least 300 words per minute (average reading speed) so ITL<150ms
Time to First Token: Time the user has to wait before seeing the first token of its answer. Lower waiting time are essential for real-time interactions, less so for offline workloads.
End to End latency: The overall time the system took to generate the full response to the user.
Throughput: The number of tokens per second the system can generate across all requests
Successful requests: The number of requests the system was able to honor in the benchmark timeframe
Error rate: The percentage of requests that ended up in error, as the system could not process them in time or failed to process them.

At the end of the test, it produces charts with, on the same plot:

Results for TGI at current commit
Results for TGI at previous commit (if any)
Results for TGI at last release tag (if any)

Results are added to #2235

It relies on run workflow artifacts to gather previous results (90 days TTL).

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

github-actions · 2024-07-12T20:58:15Z

🚀 Load test results are in:

Variable length prompts

Constant length prompts

github-actions · 2024-07-12T23:40:10Z

🚀 Load test results are in:

Variable length prompts

Constant length prompts

ErikKaum

Nice 🚀

One small comment is that this is still quite gnarly for others to use and run on their own machines. And tbh that's also because some things (like k6-sse) don't have very good DX.

I think that's cool for now, but just good to be aware of 👍

load_tests/benchmarks/templates/main.js.j2

ErikKaum · 2024-07-17T08:32:35Z

load_tests/benchmarks/engine.py

+class TGIDockerRunner(InferenceEngineRunner):
+    def __init__(self,
+                 model: str,
+                 image: str = "ghcr.io/huggingface/text-generation-inference:latest",


are we sure that this runs on the correct docker image? Since it runs on main there shouldn't be any contention on concurrent builds 🤔

You are right for PR runs that wouldn't work as we need to pinpoint a specific tag (that has to be done in load_test.py). For nightly, the current build process ensures that latest is a build run on main.

Hugoch · 2024-07-17T08:45:23Z

One small comment is that this is still quite gnarly for others to use and run on their own machines. And tbh that's also because some things (like k6-sse) don't have very good DX.

I think that's cool for now, but just good to be aware of 👍

Yeah, I agree. That would be easier to have everything containerized. But in that case we would need to mount the Docker socket in the container to be able to spawn TGI Docker from there or do some kind of Docker in Docker.

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>

…d-load-test

martinigoyanes · 2024-07-26T07:04:17Z

load_tests/benchmarks/k6.py

+
+    def run(self):
+        self.k6_config.executor.render()
+        args = f"/tmp/k6-sse run --out json=results.json {self.k6_config.executor.rendered_file}"


Hey there, awesome work @Hugoch !

Following up on @ErikKaum comment, why not use vanilla k6 and gather Time To First Token, Time Per Output Token, etc... from response headers?

This is a snippet from my own TGI load testing tool, maybe it is useful:

// Extract per request stats const end_to_end_time = res.timings.duration; const generated_tokens = parseInt(res.headers['X-Generated-Tokens']); const inference_time = parseInt(res.headers['X-Inference-Time']); const prompt_tokens = parseInt(res.headers['X-Prompt-Tokens']); const queue_time = parseInt(res.headers['X-Queue-Time']); const time_per_output_token = parseInt(res.headers['X-Time-Per-Token']); const total_time = parseInt(res.headers['X-Total-Time']); const validation_time = parseInt(res.headers['X-Validation-Time']); // Compute overhead time from I/0 const overhead_time = end_to_end_time - total_time // Compute time to first token const time_to_first_token = inference_time - generated_tokens * time_per_output_token

Hey @martinigoyanes ! Thx for the feedback!
Headers can be used but do not precisely account for the whole e2e time with HTTP layer. Time to first token cannot be precisely measured: with const time_to_first_token = inference_time - generated_tokens * time_per_output_token as you assume that time per token is the same for all tokens. It is true in average for all n>1 tokens, but you are missing the prefill phase which will delay first token.

load_tests/benchmarks/templates/k6_constant_arrival_rate.js.j2

load_tests/benchmarks/k6.py

load_tests/Makefile

load_tests/benchmarks/k6.py

Narsil · 2024-08-29T14:34:41Z

My reviews were not sent for 1 month at least, nice :(

Narsil · 2024-10-01T14:46:39Z

Closing as stale (more coming with new benchmarking tools !)

Hugoch self-assigned this Jul 11, 2024

Hugoch force-pushed the feat/add-load-test branch from 4d398a0 to 96a3e86 Compare July 12, 2024 21:33

huggingface deleted a comment from github-actions bot Jul 12, 2024

Hugoch force-pushed the feat/add-load-test branch from 96a3e86 to f8bce45 Compare July 12, 2024 21:44

feat: Add load tests

8d358d9

Hugoch force-pushed the feat/add-load-test branch from f8bce45 to 8d358d9 Compare July 12, 2024 21:49

Hugoch force-pushed the feat/add-load-test branch 8 times, most recently from 7c2014c to b4db6e0 Compare July 16, 2024 13:06

fix: Compute comparison table

4642fd2

Hugoch force-pushed the feat/add-load-test branch from b4db6e0 to 4642fd2 Compare July 16, 2024 13:13

huggingface deleted a comment from github-actions bot Jul 16, 2024

Hugoch marked this pull request as ready for review July 16, 2024 15:09

Hugoch requested review from OlivierDehaene and ErikKaum July 16, 2024 15:09

ErikKaum previously approved these changes Jul 17, 2024

View reviewed changes

Update load_tests/benchmarks/templates/main.js.j2

3a1463c

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>

Hugoch dismissed ErikKaum’s stale review via 3a1463c July 17, 2024 08:54

Hugoch added 3 commits July 17, 2024 16:44

fix: Fix to allow report for a full failed test

46775b1

Merge remote-tracking branch 'origin/feat/add-load-test' into feat/ad…

cf70a30

…d-load-test

fix: Fix to allow report for a full failed test

91a8972

martinigoyanes reviewed Jul 26, 2024

View reviewed changes

Hugoch added 2 commits August 2, 2024 14:44

Merge branch 'main' into feat/add-load-test

28b8a42

fix: Update runners

690d631

Hugoch force-pushed the feat/add-load-test branch from 4e4baae to 690d631 Compare August 2, 2024 12:59

fix: Fix Poetry path

ed83bfe

Narsil reviewed Aug 29, 2024

View reviewed changes

Hugoch added 2 commits August 30, 2024 15:29

fix: Fix PR comments (remove Jinja)

4e78b00

Merge branch 'main' into feat/add-load-test

345d473

Hugoch force-pushed the feat/add-load-test branch 5 times, most recently from a4ed18f to e6f0daf Compare August 30, 2024 15:35

fix: Fix PR comments

a258e8f

Hugoch force-pushed the feat/add-load-test branch from e6f0daf to a258e8f Compare September 2, 2024 07:36

Narsil closed this Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add load tests #2217

feat: Add load tests #2217

Hugoch commented Jul 11, 2024 •

edited

Loading

github-actions bot commented Jul 12, 2024

github-actions bot commented Jul 12, 2024

ErikKaum left a comment

ErikKaum Jul 17, 2024

Hugoch Jul 17, 2024

Hugoch commented Jul 17, 2024 •

edited

Loading

martinigoyanes Jul 26, 2024 •

edited

Loading

Hugoch Jul 26, 2024

Narsil commented Aug 29, 2024

Narsil commented Oct 1, 2024

feat: Add load tests #2217

feat: Add load tests #2217

Conversation

Hugoch commented Jul 11, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

github-actions bot commented Jul 12, 2024

Variable length prompts

Constant length prompts

github-actions bot commented Jul 12, 2024

Variable length prompts

Constant length prompts

ErikKaum left a comment

Choose a reason for hiding this comment

ErikKaum Jul 17, 2024

Choose a reason for hiding this comment

Hugoch Jul 17, 2024

Choose a reason for hiding this comment

Hugoch commented Jul 17, 2024 • edited Loading

martinigoyanes Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

Hugoch Jul 26, 2024

Choose a reason for hiding this comment

Narsil commented Aug 29, 2024

Narsil commented Oct 1, 2024

Hugoch commented Jul 11, 2024 •

edited

Loading

Hugoch commented Jul 17, 2024 •

edited

Loading

martinigoyanes Jul 26, 2024 •

edited

Loading