-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add load tests #2217
feat: Add load tests #2217
Conversation
4d398a0
to
96a3e86
Compare
96a3e86
to
f8bce45
Compare
f8bce45
to
8d358d9
Compare
7c2014c
to
b4db6e0
Compare
b4db6e0
to
4642fd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 🚀
One small comment is that this is still quite gnarly for others to use and run on their own machines. And tbh that's also because some things (like k6-sse) don't have very good DX.
I think that's cool for now, but just good to be aware of 👍
class TGIDockerRunner(InferenceEngineRunner): | ||
def __init__(self, | ||
model: str, | ||
image: str = "ghcr.io/huggingface/text-generation-inference:latest", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure that this runs on the correct docker image? Since it runs on main
there shouldn't be any contention on concurrent builds 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right for PR runs that wouldn't work as we need to pinpoint a specific tag (that has to be done in load_test.py
). For nightly, the current build process ensures that latest
is a build run on main
.
Yeah, I agree. That would be easier to have everything containerized. But in that case we would need to mount the Docker socket in the container to be able to spawn TGI Docker from there or do some kind of Docker in Docker. |
Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
load_tests/benchmarks/k6.py
Outdated
|
||
def run(self): | ||
self.k6_config.executor.render() | ||
args = f"/tmp/k6-sse run --out json=results.json {self.k6_config.executor.rendered_file}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there, awesome work @Hugoch !
Following up on @ErikKaum comment, why not use vanilla k6 and gather Time To First Token, Time Per Output Token, etc... from response headers?
This is a snippet from my own TGI load testing tool, maybe it is useful:
// Extract per request stats
const end_to_end_time = res.timings.duration;
const generated_tokens = parseInt(res.headers['X-Generated-Tokens']);
const inference_time = parseInt(res.headers['X-Inference-Time']);
const prompt_tokens = parseInt(res.headers['X-Prompt-Tokens']);
const queue_time = parseInt(res.headers['X-Queue-Time']);
const time_per_output_token = parseInt(res.headers['X-Time-Per-Token']);
const total_time = parseInt(res.headers['X-Total-Time']);
const validation_time = parseInt(res.headers['X-Validation-Time']);
// Compute overhead time from I/0
const overhead_time = end_to_end_time - total_time
// Compute time to first token
const time_to_first_token = inference_time - generated_tokens * time_per_output_token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @martinigoyanes ! Thx for the feedback!
Headers can be used but do not precisely account for the whole e2e time with HTTP layer. Time to first token cannot be precisely measured: with const time_to_first_token = inference_time - generated_tokens * time_per_output_token
as you assume that time per token is the same for all tokens. It is true in average for all n>1 tokens, but you are missing the prefill phase which will delay first token.
4e4baae
to
690d631
Compare
My reviews were not sent for 1 month at least, nice :( |
a4ed18f
to
e6f0daf
Compare
e6f0daf
to
a258e8f
Compare
Closing as stale (more coming with new benchmarking tools !) |
What does this PR do?
This PR adds automated load tests in CI using Grafana k6.
Two tests are performed:
Both tests were run for 2 different kinds of inputs:
llama-tokenizer
Test compute the following metrics:
Successful requests: The number of requests the system was able to honor in the benchmark timeframe
At the end of the test, it produces charts with, on the same plot:
Results are added to #2235
It relies on run workflow artifacts to gather previous results (90 days TTL).
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.