-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(scheduler): Match models requirements for servers not replicas & improve status handling #5107
feat(scheduler): Match models requirements for servers not replicas & improve status handling #5107
Conversation
This is easier to read in logs than spaces, which are hard to interpret as being part of structured logging.
This is better handled by clear, contextual log lines and meaningful summary messages for users. Debug trails are lists of strings, which can grow long when there are many servers, many replicas, and/or long messages per filter. These long messages are verbose and difficult for humans to read, reducing their utility vastly. They are also potentially large to store in memory per model, leading to undue memory consumption over purely informational content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use summary messages when failing to schedule.
For my understanding, this means that instead of the big debug audit trail, there's the new, deliberate summary message [1]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my current understanding, it looks like a good change to make it clear and concise what the end result is.
9ba3bc3
to
6c78410
Compare
ReplicaConfig: &pba.ReplicaConfig{InferenceSvc: "server1", | ||
InferenceHttpPort: 1, | ||
MemoryBytes: 1000, | ||
Capabilities: []string{"foo"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ℹ️ 🔧 This test is currently failing because it breaks a key assumption of the source code: that server replicas have identical capabilities. This is a reasonable assumption -- guaranteed in k8s in fact, because of how a Server
resource is defined -- and sensible outside k8s. If server replicas have different capabilities and memory/overcommit capacities, then they should really be different server resources.
I realise that this is hard to enforce outside k8s for the time being, but I'd argue anyone not adhering to the above assumptions is implementing anti-patterns that make server administration difficult.
In a follow-up PR to set capabilities per server, there could be validation logic that ensures this assumption is met, and otherwise rejects the agent attempting to register with the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same applies to scheduler_test.go
and its failing NotEnoughReplicas
test. This test case is also forcing a failure by setting non-matching capabilities for a replica, even though this breaks an assumption. Really these should either mock the underlying calls to inject failures, or should use cases where server replicas are otherwise incompatible, such as a lack of available memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does make sense. I think having validation (in a follow-up) is key. Are there documentation changes we could make too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation changes are a good suggestion. While I think it's best to enforce any assumptions, e.g. by rejecting the registration of a new agent if it makes claims that diverge from other replicas in the server it implements, clarifying expectations in docs is no bad thing. For the sake of expediency and this PR not growing any more, I'll make any docs changes in a follow-up PR.
…ommon testing paradigm
Replicas of a server should (must) have the same capabilities, but their available memory might vary independently of one another. This change makes it so that a server with enough active replicas has one without enough memory, meaning there are not enough suitable replicas for scheduling.
bde52c2
to
68f6bd7
Compare
These are unnecessary as the calling context knows the model name already. These are also duplicative, as the use of format strings to include the model name forces a near-identical string to be created in some contexts.
… improve status handling (#5107) * Add blank lines for logical separation * Add model name to logging context for scheduling method * Use logging contexts instead of format strings in log messages * Formatting & whitespace * Add log message if server has insufficiently many replicas for a model * Rewrite replica requirements filter as server filter in scheduler * Move requirements filter to server filter in scheduler config * Formatting * Fix server requirements test * Rename files to reflect changed structs * Use underscores in logging context fields for snake case This is easier to read in logs than spaces, which are hard to interpret as being part of structured logging. * Add contextual logging for server & replica filters for model scheduling * Remove debug trail handling for model scheduling This is better handled by clear, contextual log lines and meaningful summary messages for users. Debug trails are lists of strings, which can grow long when there are many servers, many replicas, and/or long messages per filter. These long messages are verbose and difficult for humans to read, reducing their utility vastly. They are also potentially large to store in memory per model, leading to undue memory consumption over purely informational content. * Remove indentation with a return in model scheduling logic * Add comment for what was an else block * Format long line * Add more logging & return meaningful errors from model scheduling * Reword log line for clarity * Update server reqs filter description to clarify when server capabilities are unknown * Remove erroneous argument to non-format string * Format long lines in scheduler server test * Add blank lines & comments for logical separation and to describe a common testing paradigm * Fix bug setting log level for wrong logger in test * Reduce default logging verbosity in test * Change server replica config for test case Replicas of a server should (must) have the same capabilities, but their available memory might vary independently of one another. This change makes it so that a server with enough active replicas has one without enough memory, meaning there are not enough suitable replicas for scheduling. * Change server replica config for another test case * Remove model names from error returns of method These are unnecessary as the calling context knows the model name already. These are also duplicative, as the use of format strings to include the model name forces a near-identical string to be created in some contexts. * Add server requirements test case for no server replicas * Format long lines in server requirements test * Add server requirements test cases for multiple server replicas * More test formatting
Why
Motivation
Currently, when scheduling a model fails, e.g. due to more replicas being requested than are available on an otherwise compatible server, error/status messages are not very informative. As a result, the Kubernetes status is also not as informative as it could be.
Prior to this PR, scheduling a model used a
debugTrail
that would accumulate a list of scheduling failures. In itself, this is not very useful for anyone wanting to understand, concisely, why a model failed to be scheduled. In fact, this might see messages about various servers they know aren't compatible and wouldn't expect to be used.This PR improves the situation for both error/status messages and logs more generally. Furthermore, it changes the requirements filtering from applying to server replicas to applying to servers. All the replicas of a single server share their requirements, so it is more efficient to filter for this earlier rather than later.
What
Changes
Testing
Running in a simple
kind
cluster (as provided via the Ansible playbooks), I built and loaded a new scheduler corresponding to the changes on this branch:I then created a simple iris model with five replicas; this is more than the default MLServer installation supports with its single replica:
Relevant scheduler logs:
Creating a model with a single replica succeeds: