You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a production environment, we receive numerous simultaneous requests and want to respond to them as quickly as possible. However, it seems that an instance of the model can only respond to one request at a time (until it's instructed to stop generating tokens or if it has finished), and loading multiple instances into memory is unfeasible and slow – the operating system scheduler would go crazy! I haven't seen the complete source code, but I suspect that many threads are instantiated and only a few or just one process. If that's the case, at least the model will save resources compared to using multiprocessing, but we will still face difficulties in handling multiple concurrent requests. One solution is to use a software design pattern like the Worker or Producer-Consumer pattern to create fixed instances (meaning we'll have x threads to handle various simultaneous requests).
Worker Design:
The Manager receives 300 simultaneous user requests and assigns them to Workers.
Each Worker processes a token, stores some conversation state in Redis (I'm still not sure how to exactly collect states from llama.cpp), and then gives way to the next request (context-switch).
While some workers are accepting new requests after processing a token, other sibling workers can continue processing pending requests and return to step 2.
This software model describes how I envision handling multiple requests at the same time without creating an instance for each request. However, I'm still unsure about how to do the following:
How to obtain conversation states, where exactly it left off? The worker interrupts processing to perform a context-switch and handle another request, so it's necessary to store the states somewhere.
Is there a more sophisticated approach than mine? I'm not aware of one.
I'm familiar with the functions EnablePromptCacheAll and SaveState/LoadState, but what exactly do they do? Are these the functions needed to save the state? However, I want to save it in Redis and not on disk (because it's slow).
Here's an additional question:
How to disable logs?
"llama.133123.log" is filling up my disk, and I haven't figured out how to turn off this annoying thing! In production, I usually use a more sophisticated way to manage logs in a scalable manner.
This discussion was converted from issue #221 on September 20, 2023 07:25.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
In a production environment, we receive numerous simultaneous requests and want to respond to them as quickly as possible. However, it seems that an instance of the model can only respond to one request at a time (until it's instructed to stop generating tokens or if it has finished), and loading multiple instances into memory is unfeasible and slow – the operating system scheduler would go crazy! I haven't seen the complete source code, but I suspect that many threads are instantiated and only a few or just one process. If that's the case, at least the model will save resources compared to using multiprocessing, but we will still face difficulties in handling multiple concurrent requests. One solution is to use a software design pattern like the Worker or Producer-Consumer pattern to create fixed instances (meaning we'll have x threads to handle various simultaneous requests).
Worker Design:
This software model describes how I envision handling multiple requests at the same time without creating an instance for each request. However, I'm still unsure about how to do the following:
Here's an additional question:
"llama.133123.log" is filling up my disk, and I haven't figured out how to turn off this annoying thing! In production, I usually use a more sophisticated way to manage logs in a scalable manner.
Beta Was this translation helpful? Give feedback.
All reactions