Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add load tests #2217

Closed
wants to merge 12 commits into from
Closed

feat: Add load tests #2217

wants to merge 12 commits into from

Conversation

Hugoch
Copy link
Member

@Hugoch Hugoch commented Jul 11, 2024

What does this PR do?

This PR adds automated load tests in CI using Grafana k6.

Two tests are performed:

  • Constant virtual users (VUs) load test: It simulates a fixed pool of users trying to make as many requests as possible on API during 60 seconds.
  • Constant arrival rate load test: It simulates a constant rate of user requests arrival, independent of the system’s response rate, during 60 seconds.

Both tests were run for 2 different kinds of inputs:

  • 5000 ShareGPT prompts randomly selected (variable token length)
  • 5000 ShareGPT prompts truncated to 500 tokens (constant token length). Tokens count usesllama-tokenizer

Test compute the following metrics:

  • Inter token latency: Time to generate a new output token for each user that is querying the system. It translates as the “speed” perceived by the end-user. We aim for at least 300 words per minute (average reading speed) so ITL<150ms
  • Time to First Token: Time the user has to wait before seeing the first token of its answer. Lower waiting time are essential for real-time interactions, less so for offline workloads.
  • End to End latency: The overall time the system took to generate the full response to the user.
  • Throughput: The number of tokens per second the system can generate across all requests
    Successful requests: The number of requests the system was able to honor in the benchmark timeframe
  • Error rate: The percentage of requests that ended up in error, as the system could not process them in time or failed to process them.

At the end of the test, it produces charts with, on the same plot:

  • Results for TGI at current commit
  • Results for TGI at previous commit (if any)
  • Results for TGI at last release tag (if any)

Results are added to #2235

It relies on run workflow artifacts to gather previous results (90 days TTL).

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Hugoch Hugoch self-assigned this Jul 11, 2024
Copy link

🚀 Load test results are in:

Variable length prompts

Constant length prompts

Copy link

🚀 Load test results are in:

Variable length prompts

Constant length prompts

@Hugoch Hugoch force-pushed the feat/add-load-test branch 8 times, most recently from 7c2014c to b4db6e0 Compare July 16, 2024 13:06
@huggingface huggingface deleted a comment from github-actions bot Jul 16, 2024
@Hugoch Hugoch marked this pull request as ready for review July 16, 2024 15:09
ErikKaum
ErikKaum previously approved these changes Jul 17, 2024
Copy link
Member

@ErikKaum ErikKaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 🚀

One small comment is that this is still quite gnarly for others to use and run on their own machines. And tbh that's also because some things (like k6-sse) don't have very good DX.

I think that's cool for now, but just good to be aware of 👍

load_tests/benchmarks/templates/main.js.j2 Outdated Show resolved Hide resolved
class TGIDockerRunner(InferenceEngineRunner):
def __init__(self,
model: str,
image: str = "ghcr.io/huggingface/text-generation-inference:latest",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that this runs on the correct docker image? Since it runs on main there shouldn't be any contention on concurrent builds 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right for PR runs that wouldn't work as we need to pinpoint a specific tag (that has to be done in load_test.py). For nightly, the current build process ensures that latest is a build run on main.

@Hugoch
Copy link
Member Author

Hugoch commented Jul 17, 2024

One small comment is that this is still quite gnarly for others to use and run on their own machines. And tbh that's also because some things (like k6-sse) don't have very good DX.

I think that's cool for now, but just good to be aware of 👍

Yeah, I agree. That would be easier to have everything containerized. But in that case we would need to mount the Docker socket in the container to be able to spawn TGI Docker from there or do some kind of Docker in Docker.

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>

def run(self):
self.k6_config.executor.render()
args = f"/tmp/k6-sse run --out json=results.json {self.k6_config.executor.rendered_file}"
Copy link
Contributor

@martinigoyanes martinigoyanes Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there, awesome work @Hugoch !

Following up on @ErikKaum comment, why not use vanilla k6 and gather Time To First Token, Time Per Output Token, etc... from response headers?

This is a snippet from my own TGI load testing tool, maybe it is useful:

// Extract per request stats
const end_to_end_time = res.timings.duration;
const generated_tokens = parseInt(res.headers['X-Generated-Tokens']);
const inference_time = parseInt(res.headers['X-Inference-Time']);
const prompt_tokens = parseInt(res.headers['X-Prompt-Tokens']);
const queue_time = parseInt(res.headers['X-Queue-Time']);
const time_per_output_token = parseInt(res.headers['X-Time-Per-Token']);
const total_time = parseInt(res.headers['X-Total-Time']);
const validation_time = parseInt(res.headers['X-Validation-Time']);

// Compute overhead time from I/0
const overhead_time = end_to_end_time - total_time
// Compute time to first token
const time_to_first_token = inference_time - generated_tokens * time_per_output_token

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @martinigoyanes ! Thx for the feedback!
Headers can be used but do not precisely account for the whole e2e time with HTTP layer. Time to first token cannot be precisely measured: with const time_to_first_token = inference_time - generated_tokens * time_per_output_token as you assume that time per token is the same for all tokens. It is true in average for all n>1 tokens, but you are missing the prefill phase which will delay first token.

load_tests/benchmarks/k6.py Outdated Show resolved Hide resolved
load_tests/Makefile Outdated Show resolved Hide resolved
load_tests/Makefile Show resolved Hide resolved
load_tests/benchmarks/k6.py Outdated Show resolved Hide resolved
@Narsil
Copy link
Collaborator

Narsil commented Aug 29, 2024

My reviews were not sent for 1 month at least, nice :(

@Hugoch Hugoch force-pushed the feat/add-load-test branch 5 times, most recently from a4ed18f to e6f0daf Compare August 30, 2024 15:35
@Narsil
Copy link
Collaborator

Narsil commented Oct 1, 2024

Closing as stale (more coming with new benchmarking tools !)

@Narsil Narsil closed this Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants