inference from the same model for multiple concurrent users #223

alph4b3th · 2023-09-19T00:20:13Z

alph4b3th
Sep 19, 2023

In a production environment, we receive numerous simultaneous requests and want to respond to them as quickly as possible. However, it seems that an instance of the model can only respond to one request at a time (until it's instructed to stop generating tokens or if it has finished), and loading multiple instances into memory is unfeasible and slow – the operating system scheduler would go crazy! I haven't seen the complete source code, but I suspect that many threads are instantiated and only a few or just one process. If that's the case, at least the model will save resources compared to using multiprocessing, but we will still face difficulties in handling multiple concurrent requests. One solution is to use a software design pattern like the Worker or Producer-Consumer pattern to create fixed instances (meaning we'll have x threads to handle various simultaneous requests).

Worker Design:

The Manager receives 300 simultaneous user requests and assigns them to Workers.
Each Worker processes a token, stores some conversation state in Redis (I'm still not sure how to exactly collect states from llama.cpp), and then gives way to the next request (context-switch).
While some workers are accepting new requests after processing a token, other sibling workers can continue processing pending requests and return to step 2.

This software model describes how I envision handling multiple requests at the same time without creating an instance for each request. However, I'm still unsure about how to do the following:

How to obtain conversation states, where exactly it left off? The worker interrupts processing to perform a context-switch and handle another request, so it's necessary to store the states somewhere.
Is there a more sophisticated approach than mine? I'm not aware of one.
I'm familiar with the functions EnablePromptCacheAll and SaveState/LoadState, but what exactly do they do? Are these the functions needed to save the state? However, I want to save it in Redis and not on disk (because it's slow).

Here's an additional question:

How to disable logs?

"llama.133123.log" is filling up my disk, and I haven't figured out how to turn off this annoying thing! In production, I usually use a more sophisticated way to manage logs in a scalable manner.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference from the same model for multiple concurrent users #223

{{title}}

Replies: 0 comments

Select a reply

inference from the same model for multiple concurrent users #223

alph4b3th Sep 19, 2023

Worker Design:

Replies: 0 comments

alph4b3th
Sep 19, 2023