Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(scheduler): Match models requirements for servers not replicas & improve status handling #5107

Merged

Conversation

agrski
Copy link
Contributor

@agrski agrski commented Aug 23, 2023

Why

Motivation

Currently, when scheduling a model fails, e.g. due to more replicas being requested than are available on an otherwise compatible server, error/status messages are not very informative. As a result, the Kubernetes status is also not as informative as it could be.

Prior to this PR, scheduling a model used a debugTrail that would accumulate a list of scheduling failures. In itself, this is not very useful for anyone wanting to understand, concisely, why a model failed to be scheduled. In fact, this might see messages about various servers they know aren't compatible and wouldn't expect to be used.

This PR improves the situation for both error/status messages and logs more generally. Furthermore, it changes the requirements filtering from applying to server replicas to applying to servers. All the replicas of a single server share their requirements, so it is more efficient to filter for this earlier rather than later.

What

Changes

  • Change replica requirements filter to server requirements filter.
  • Use summary messages when failing to schedule.
  • Log intermediate scheduling information instead of accumulating it for statuses.

Testing

Running in a simple kind cluster (as provided via the Ansible playbooks), I built and loaded a new scheduler corresponding to the changes on this branch:

make -C scheduler/ docker-build-scheduler
make -C scheduler/ kind-image-install-scheduler
kubectl -n seldon rollout restart sts seldon-scheduler

I then created a simple iris model with five replicas; this is more than the default MLServer installation supports with its single replica:

kubectl delete model -n seldon iris
cat samples/models/iris-v1.yaml | sed '/^spec:$/a\  replicas: 5' | kubectl -n seldon apply -f -
kubectl get model -n seldon iris -o json | jq '.status.conditions[] | select(.type == "ModelReady")'
{
  "lastTransitionTime": "2023-08-23T15:54:27Z",
  "message": "ModelFailed",
  "reason": "rpc error: code = FailedPrecondition desc = Failed to schedule model iris as no matching server had enough suitable replicas",
  "status": "False",
  "type": "ModelReady"
}

Relevant scheduler logs:

time="2023-08-23T15:54:27Z" level=debug msg="Schedule model" Name=SimpleScheduler func=scheduleToServer model=iris
time="2023-08-23T15:54:27Z" level=debug msg="Filtering servers for model" Name=SimpleScheduler func=filterServer model=iris num_servers=2
time="2023-08-23T15:54:27Z" level=debug msg="Accepting server for model" Name=SimpleScheduler func=filterServer model=iris server=mlserver
time="2023-08-23T15:54:27Z" level=debug msg="Rejecting server for model" Name=SimpleScheduler filter=ServerRequirementsFilter func=filterServer model=iris reason="model requirements [sklearn] se
rver capabilities [triton dali fil onnx openvino python pytorch tensorflow tensorrt]" server=triton
time="2023-08-23T15:54:27Z" level=debug msg="Identified candidate servers for model" Name=SimpleScheduler candidate_servers="[mlserver]" desired_replicas=5 func=scheduleToServer model=iris
time="2023-08-23T15:54:27Z" level=debug msg="Checking compatibility with candidate server" Name=SimpleScheduler func=scheduleToServer model=iris server=mlserver
time="2023-08-23T15:54:27Z" level=debug msg="Filtering server replicas for model" Name=SimpleScheduler func=filterReplicas model=iris server=mlserver
time="2023-08-23T15:54:27Z" level=debug msg="Accepting server replica for model" Name=SimpleScheduler func=filterReplicas model=iris replica=0 server=mlserver
time="2023-08-23T15:54:27Z" level=debug msg="Skipping server due to insufficient suitable replicas" Name=SimpleScheduler available_replicas=1 desired_replicas=5 func=scheduleToServer model=iris 
server=mlserver
time="2023-08-23T15:54:27Z" level=debug msg="Failed to schedule model as no matching server had enough suitable replicas" Name=SimpleScheduler func=scheduleToServer model=iris

Creating a model with a single replica succeeds:

kubectl delete model -n seldon iris
kubectl -n seldon apply -f samples/models/iris-v1.yaml
kubectl get model -n seldon iris -o json | jq '.status.conditions[] | select(.type == "ModelReady")'
{
  "lastTransitionTime": "2023-08-25T13:20:39Z",
  "message": "ModelAvailable",
  "status": "True",
  "type": "ModelReady"
}

This is easier to read in logs than spaces, which are hard to interpret as being part of structured logging.
This is better handled by clear, contextual log lines and meaningful summary messages for users.
Debug trails are lists of strings, which can grow long when there are many servers, many replicas, and/or long messages per filter.
These long messages are verbose and difficult for humans to read, reducing their utility vastly.
They are also potentially large to store in memory per model, leading to undue memory consumption over purely informational content.
@agrski agrski self-assigned this Aug 23, 2023
@agrski agrski added the v2 label Aug 23, 2023
@agrski agrski changed the title Update model status when insufficient server replicas feat(scheduler): Match models requirements for servers not replicas & improve status handling Aug 23, 2023
Copy link
Contributor

@jesse-c jesse-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use summary messages when failing to schedule.

For my understanding, this means that instead of the big debug audit trail, there's the new, deliberate summary message [1]?

[1] https://github.com/SeldonIO/seldon-core/pull/5107/files#diff-b6060c1a371e5e0f7f24614386ef51aace951716ec8c2581d682aaf234354ad4R211

scheduler/pkg/scheduler/filters/serverrequirements.go Outdated Show resolved Hide resolved
scheduler/pkg/scheduler/filters/serverrequirements.go Outdated Show resolved Hide resolved
scheduler/pkg/scheduler/scheduler.go Outdated Show resolved Hide resolved
scheduler/pkg/scheduler/scheduler.go Show resolved Hide resolved
Copy link
Contributor

@jesse-c jesse-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my current understanding, it looks like a good change to make it clear and concise what the end result is.

scheduler/pkg/scheduler/filters/serverrequirements.go Outdated Show resolved Hide resolved
scheduler/pkg/scheduler/filters/serverrequirements.go Outdated Show resolved Hide resolved
@agrski agrski force-pushed the update-model-status-when-insufficient-server-replicas branch from 9ba3bc3 to 6c78410 Compare August 24, 2023 14:43
ReplicaConfig: &pba.ReplicaConfig{InferenceSvc: "server1",
InferenceHttpPort: 1,
MemoryBytes: 1000,
Capabilities: []string{"foo"},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ 🔧 This test is currently failing because it breaks a key assumption of the source code: that server replicas have identical capabilities. This is a reasonable assumption -- guaranteed in k8s in fact, because of how a Server resource is defined -- and sensible outside k8s. If server replicas have different capabilities and memory/overcommit capacities, then they should really be different server resources.

I realise that this is hard to enforce outside k8s for the time being, but I'd argue anyone not adhering to the above assumptions is implementing anti-patterns that make server administration difficult.

In a follow-up PR to set capabilities per server, there could be validation logic that ensures this assumption is met, and otherwise rejects the agent attempting to register with the scheduler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same applies to scheduler_test.go and its failing NotEnoughReplicas test. This test case is also forcing a failure by setting non-matching capabilities for a replica, even though this breaks an assumption. Really these should either mock the underlying calls to inject failures, or should use cases where server replicas are otherwise incompatible, such as a lack of available memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does make sense. I think having validation (in a follow-up) is key. Are there documentation changes we could make too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation changes are a good suggestion. While I think it's best to enforce any assumptions, e.g. by rejecting the registration of a new agent if it makes claims that diverge from other replicas in the server it implements, clarifying expectations in docs is no bad thing. For the sake of expediency and this PR not growing any more, I'll make any docs changes in a follow-up PR.

Replicas of a server should (must) have the same capabilities, but their available memory might vary independently of one another.
This change makes it so that a server with enough active replicas has one without enough memory, meaning there are not enough suitable replicas for scheduling.
@agrski agrski force-pushed the update-model-status-when-insufficient-server-replicas branch from bde52c2 to 68f6bd7 Compare August 24, 2023 17:06
These are unnecessary as the calling context knows the model name already.
These are also duplicative, as the use of format strings to include the model name forces a near-identical string to be created in some contexts.
@agrski agrski marked this pull request as ready for review August 24, 2023 18:31
@agrski agrski merged commit e182d08 into SeldonIO:v2 Aug 25, 2023
5 checks passed
@agrski agrski deleted the update-model-status-when-insufficient-server-replicas branch August 25, 2023 13:22
adriangonz pushed a commit that referenced this pull request Aug 31, 2023
… improve status handling (#5107)

* Add blank lines for logical separation

* Add model name to logging context for scheduling method

* Use logging contexts instead of format strings in log messages

* Formatting & whitespace

* Add log message if server has insufficiently many replicas for a model

* Rewrite replica requirements filter as server filter in scheduler

* Move requirements filter to server filter in scheduler config

* Formatting

* Fix server requirements test

* Rename files to reflect changed structs

* Use underscores in logging context fields for snake case

This is easier to read in logs than spaces, which are hard to interpret as being part of structured logging.

* Add contextual logging for server & replica filters for model scheduling

* Remove debug trail handling for model scheduling

This is better handled by clear, contextual log lines and meaningful summary messages for users.
Debug trails are lists of strings, which can grow long when there are many servers, many replicas, and/or long messages per filter.
These long messages are verbose and difficult for humans to read, reducing their utility vastly.
They are also potentially large to store in memory per model, leading to undue memory consumption over purely informational content.

* Remove indentation with a return in model scheduling logic

* Add comment for what was an else block

* Format long line

* Add more logging & return meaningful errors from model scheduling

* Reword log line for clarity

* Update server reqs filter description to clarify when server capabilities are unknown

* Remove erroneous argument to non-format string

* Format long lines in scheduler server test

* Add blank lines & comments for logical separation and to describe a common testing paradigm

* Fix bug setting log level for wrong logger in test

* Reduce default logging verbosity in test

* Change server replica config for test case

Replicas of a server should (must) have the same capabilities, but their available memory might vary independently of one another.
This change makes it so that a server with enough active replicas has one without enough memory, meaning there are not enough suitable replicas for scheduling.

* Change server replica config for another test case

* Remove model names from error returns of method

These are unnecessary as the calling context knows the model name already.
These are also duplicative, as the use of format strings to include the model name forces a near-identical string to be created in some contexts.

* Add server requirements test case for no server replicas

* Format long lines in server requirements test

* Add server requirements test cases for multiple server replicas

* More test formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants