Add model_load_time metric #2311

Edwinhr716 · 2024-07-26T00:48:24Z

What does this PR do?

Adds metric that measures the time spent in downloading the model, loading into GPU memory, and time it takes for the server to be ready to receive a request.

Because the router is the component that emits the metrics, but the launcher is the one who downloads the model and thus is where the download time is tracked, it is necessary to pass the duration from the launcher to the router. I considered passing it as a CLI argument, but to minimize the number of changes required opted to use an environment variable to do it instead. Open to suggestions as to how to better pass the value to the router.

To make it easier to work with Rust's Instant type I opted to measure two different values and then add them together: the time it takes to download to the model to launching the router; and the time it takes from launching the router to the router being ready to receive requests.

This PR is part of the metrics standardization effort.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. [Feature]: Additional metrics to enable better autoscaling / load balancing of TGI servers in Kubernetes #1977
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Narsil · 2024-08-09T16:05:49Z

I'm still confused what's the interest to measure this one huge metric that encompasses so many different things:

Model downloads, from where
Potential model conversion
Actual model load time which can be coming from disk or CPU RAM.

As a model runner, getting ready times can be interesting, but without more context it's quite useless, no ?
What are users supposed to do with that metric ? It's also never being modified during the server's lifetime, so it's not really probing the system to do monitoring no ?
In our monitoring systems, we only care about the logs showing insights as to what's happening, which contain every step, why they are occurring and how long each step takes.

I have nothing against the PR itself, but adding code without clear reasons for clear benefits is always a bit strange.

Edwinhr716 · 2024-08-09T18:05:57Z

What are users supposed to do with that metric ?

because it is essentially the startup latency of the model, it is useful in determining the pod autoscaling threshold and frequency. So if model-load-time is above say 40 seconds, the user might not want to decrease the number of pods too often as it takes long to create a new pod once demand rises again.

cc @achandrasekar

Edwinhr716 added 3 commits July 25, 2024 22:20

added implementation that requires new cli argument

c27075d

enviroment variable approach

9697d16

removed tracing logs

592ea3f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add model_load_time metric #2311

Add model_load_time metric #2311

Edwinhr716 commented Jul 26, 2024

Narsil commented Aug 9, 2024

Edwinhr716 commented Aug 9, 2024

Add model_load_time metric #2311

Are you sure you want to change the base?

Add model_load_time metric #2311

Conversation

Edwinhr716 commented Jul 26, 2024

What does this PR do?

Before submitting

Who can review?

Narsil commented Aug 9, 2024

Edwinhr716 commented Aug 9, 2024