feat(scheduler): Match models requirements for servers not replicas & improve status handling #5107

agrski · 2023-08-23T16:01:59Z

Why

Motivation

Currently, when scheduling a model fails, e.g. due to more replicas being requested than are available on an otherwise compatible server, error/status messages are not very informative. As a result, the Kubernetes status is also not as informative as it could be.

Prior to this PR, scheduling a model used a debugTrail that would accumulate a list of scheduling failures. In itself, this is not very useful for anyone wanting to understand, concisely, why a model failed to be scheduled. In fact, this might see messages about various servers they know aren't compatible and wouldn't expect to be used.

This PR improves the situation for both error/status messages and logs more generally. Furthermore, it changes the requirements filtering from applying to server replicas to applying to servers. All the replicas of a single server share their requirements, so it is more efficient to filter for this earlier rather than later.

What

Changes

Change replica requirements filter to server requirements filter.
Use summary messages when failing to schedule.
Log intermediate scheduling information instead of accumulating it for statuses.

Testing

Running in a simple kind cluster (as provided via the Ansible playbooks), I built and loaded a new scheduler corresponding to the changes on this branch:

make -C scheduler/ docker-build-scheduler
make -C scheduler/ kind-image-install-scheduler
kubectl -n seldon rollout restart sts seldon-scheduler

I then created a simple iris model with five replicas; this is more than the default MLServer installation supports with its single replica:

kubectl delete model -n seldon iris
cat samples/models/iris-v1.yaml | sed '/^spec:$/a\  replicas: 5' | kubectl -n seldon apply -f -
kubectl get model -n seldon iris -o json | jq '.status.conditions[] | select(.type == "ModelReady")'

{
  "lastTransitionTime": "2023-08-23T15:54:27Z",
  "message": "ModelFailed",
  "reason": "rpc error: code = FailedPrecondition desc = Failed to schedule model iris as no matching server had enough suitable replicas",
  "status": "False",
  "type": "ModelReady"
}

Relevant scheduler logs:

time="2023-08-23T15:54:27Z" level=debug msg="Schedule model" Name=SimpleScheduler func=scheduleToServer model=iris
time="2023-08-23T15:54:27Z" level=debug msg="Filtering servers for model" Name=SimpleScheduler func=filterServer model=iris num_servers=2
time="2023-08-23T15:54:27Z" level=debug msg="Accepting server for model" Name=SimpleScheduler func=filterServer model=iris server=mlserver
time="2023-08-23T15:54:27Z" level=debug msg="Rejecting server for model" Name=SimpleScheduler filter=ServerRequirementsFilter func=filterServer model=iris reason="model requirements [sklearn] se
rver capabilities [triton dali fil onnx openvino python pytorch tensorflow tensorrt]" server=triton
time="2023-08-23T15:54:27Z" level=debug msg="Identified candidate servers for model" Name=SimpleScheduler candidate_servers="[mlserver]" desired_replicas=5 func=scheduleToServer model=iris
time="2023-08-23T15:54:27Z" level=debug msg="Checking compatibility with candidate server" Name=SimpleScheduler func=scheduleToServer model=iris server=mlserver
time="2023-08-23T15:54:27Z" level=debug msg="Filtering server replicas for model" Name=SimpleScheduler func=filterReplicas model=iris server=mlserver
time="2023-08-23T15:54:27Z" level=debug msg="Accepting server replica for model" Name=SimpleScheduler func=filterReplicas model=iris replica=0 server=mlserver
time="2023-08-23T15:54:27Z" level=debug msg="Skipping server due to insufficient suitable replicas" Name=SimpleScheduler available_replicas=1 desired_replicas=5 func=scheduleToServer model=iris 
server=mlserver
time="2023-08-23T15:54:27Z" level=debug msg="Failed to schedule model as no matching server had enough suitable replicas" Name=SimpleScheduler func=scheduleToServer model=iris

Creating a model with a single replica succeeds:

kubectl delete model -n seldon iris
kubectl -n seldon apply -f samples/models/iris-v1.yaml
kubectl get model -n seldon iris -o json | jq '.status.conditions[] | select(.type == "ModelReady")'

{
  "lastTransitionTime": "2023-08-25T13:20:39Z",
  "message": "ModelAvailable",
  "status": "True",
  "type": "ModelReady"
}

This is easier to read in logs than spaces, which are hard to interpret as being part of structured logging.

This is better handled by clear, contextual log lines and meaningful summary messages for users. Debug trails are lists of strings, which can grow long when there are many servers, many replicas, and/or long messages per filter. These long messages are verbose and difficult for humans to read, reducing their utility vastly. They are also potentially large to store in memory per model, leading to undue memory consumption over purely informational content.

jesse-c

Use summary messages when failing to schedule.

For my understanding, this means that instead of the big debug audit trail, there's the new, deliberate summary message [1]?

[1] https://github.com/SeldonIO/seldon-core/pull/5107/files#diff-b6060c1a371e5e0f7f24614386ef51aace951716ec8c2581d682aaf234354ad4R211

scheduler/pkg/scheduler/filters/serverrequirements.go

scheduler/pkg/scheduler/scheduler.go

jesse-c

From my current understanding, it looks like a good change to make it clear and concise what the end result is.

scheduler/pkg/scheduler/filters/serverrequirements.go

…ties are unknown

scheduler/pkg/scheduler/filters/serverrequirements_test.go

scheduler/pkg/scheduler/scheduler.go

agrski · 2023-08-24T16:34:33Z

scheduler/pkg/server/server_test.go

+					ReplicaConfig: &pba.ReplicaConfig{InferenceSvc: "server1",
+						InferenceHttpPort: 1,
+						MemoryBytes:       1000,
+						Capabilities:      []string{"foo"},


ℹ️ 🔧 This test is currently failing because it breaks a key assumption of the source code: that server replicas have identical capabilities. This is a reasonable assumption -- guaranteed in k8s in fact, because of how a Server resource is defined -- and sensible outside k8s. If server replicas have different capabilities and memory/overcommit capacities, then they should really be different server resources.

I realise that this is hard to enforce outside k8s for the time being, but I'd argue anyone not adhering to the above assumptions is implementing anti-patterns that make server administration difficult.

In a follow-up PR to set capabilities per server, there could be validation logic that ensures this assumption is met, and otherwise rejects the agent attempting to register with the scheduler.

The same applies to scheduler_test.go and its failing NotEnoughReplicas test. This test case is also forcing a failure by setting non-matching capabilities for a replica, even though this breaks an assumption. Really these should either mock the underlying calls to inject failures, or should use cases where server replicas are otherwise incompatible, such as a lack of available memory.

That does make sense. I think having validation (in a follow-up) is key. Are there documentation changes we could make too?

Documentation changes are a good suggestion. While I think it's best to enforce any assumptions, e.g. by rejecting the registration of a new agent if it makes claims that diverge from other replicas in the server it implements, clarifying expectations in docs is no bad thing. For the sake of expediency and this PR not growing any more, I'll make any docs changes in a follow-up PR.

…ommon testing paradigm

Replicas of a server should (must) have the same capabilities, but their available memory might vary independently of one another. This change makes it so that a server with enough active replicas has one without enough memory, meaning there are not enough suitable replicas for scheduling.

These are unnecessary as the calling context knows the model name already. These are also duplicative, as the use of format strings to include the model name forces a near-identical string to be created in some contexts.

… improve status handling (#5107) * Add blank lines for logical separation * Add model name to logging context for scheduling method * Use logging contexts instead of format strings in log messages * Formatting & whitespace * Add log message if server has insufficiently many replicas for a model * Rewrite replica requirements filter as server filter in scheduler * Move requirements filter to server filter in scheduler config * Formatting * Fix server requirements test * Rename files to reflect changed structs * Use underscores in logging context fields for snake case This is easier to read in logs than spaces, which are hard to interpret as being part of structured logging. * Add contextual logging for server & replica filters for model scheduling * Remove debug trail handling for model scheduling This is better handled by clear, contextual log lines and meaningful summary messages for users. Debug trails are lists of strings, which can grow long when there are many servers, many replicas, and/or long messages per filter. These long messages are verbose and difficult for humans to read, reducing their utility vastly. They are also potentially large to store in memory per model, leading to undue memory consumption over purely informational content. * Remove indentation with a return in model scheduling logic * Add comment for what was an else block * Format long line * Add more logging & return meaningful errors from model scheduling * Reword log line for clarity * Update server reqs filter description to clarify when server capabilities are unknown * Remove erroneous argument to non-format string * Format long lines in scheduler server test * Add blank lines & comments for logical separation and to describe a common testing paradigm * Fix bug setting log level for wrong logger in test * Reduce default logging verbosity in test * Change server replica config for test case Replicas of a server should (must) have the same capabilities, but their available memory might vary independently of one another. This change makes it so that a server with enough active replicas has one without enough memory, meaning there are not enough suitable replicas for scheduling. * Change server replica config for another test case * Remove model names from error returns of method These are unnecessary as the calling context knows the model name already. These are also duplicative, as the use of format strings to include the model name forces a near-identical string to be created in some contexts. * Add server requirements test case for no server replicas * Format long lines in server requirements test * Add server requirements test cases for multiple server replicas * More test formatting

agrski added 18 commits August 22, 2023 17:13

Add blank lines for logical separation

d6ba3c8

Add model name to logging context for scheduling method

59538a7

Use logging contexts instead of format strings in log messages

ac21a1c

Formatting & whitespace

b4c7fde

Add log message if server has insufficiently many replicas for a model

0f3fe34

Rewrite replica requirements filter as server filter in scheduler

a627bca

Move requirements filter to server filter in scheduler config

14a938a

Formatting

b753b3d

Fix server requirements test

aa0507e

Rename files to reflect changed structs

5062e3d

Use underscores in logging context fields for snake case

f0daac0

This is easier to read in logs than spaces, which are hard to interpret as being part of structured logging.

Add contextual logging for server & replica filters for model scheduling

cb0aca3

Remove indentation with a return in model scheduling logic

311567c

Add comment for what was an else block

27c0441

Format long line

f22ab39

Add more logging & return meaningful errors from model scheduling

638f1ca

Reword log line for clarity

345a846

agrski self-assigned this Aug 23, 2023

agrski added the v2 label Aug 23, 2023

agrski changed the title ~~Update model status when insufficient server replicas~~ feat(scheduler): Match models requirements for servers not replicas & improve status handling Aug 23, 2023

agrski requested review from jesse-c, ukclivecox and RafalSkolasinski August 23, 2023 16:17

jesse-c reviewed Aug 24, 2023

View reviewed changes

jesse-c approved these changes Aug 24, 2023

View reviewed changes

scheduler/pkg/scheduler/filters/serverrequirements.go Show resolved Hide resolved

scheduler/pkg/scheduler/filters/serverrequirements.go Outdated Show resolved Hide resolved

scheduler/pkg/scheduler/filters/serverrequirements.go Outdated Show resolved Hide resolved

agrski added 2 commits August 24, 2023 15:43

Update server reqs filter description to clarify when server capabili…

4dc0bf9

…ties are unknown

Remove erroneous argument to non-format string

6c78410

agrski force-pushed the update-model-status-when-insufficient-server-replicas branch from 9ba3bc3 to 6c78410 Compare August 24, 2023 14:43

agrski commented Aug 24, 2023

View reviewed changes

scheduler/pkg/scheduler/filters/serverrequirements_test.go Show resolved Hide resolved

agrski commented Aug 24, 2023

View reviewed changes

scheduler/pkg/scheduler/scheduler.go Outdated Show resolved Hide resolved

agrski commented Aug 24, 2023

View reviewed changes

agrski added 6 commits August 24, 2023 18:05

Format long lines in scheduler server test

02c5fca

Add blank lines & comments for logical separation and to describe a c…

5734157

…ommon testing paradigm

Fix bug setting log level for wrong logger in test

eb9a0f6

Reduce default logging verbosity in test

421c1c3

Change server replica config for another test case

68f6bd7

agrski force-pushed the update-model-status-when-insufficient-server-replicas branch from bde52c2 to 68f6bd7 Compare August 24, 2023 17:06

agrski added 5 commits August 24, 2023 19:16

Remove model names from error returns of method

f0a39c4

These are unnecessary as the calling context knows the model name already. These are also duplicative, as the use of format strings to include the model name forces a near-identical string to be created in some contexts.

Add server requirements test case for no server replicas

a527b60

Format long lines in server requirements test

afb7236

Add server requirements test cases for multiple server replicas

bb22ca2

More test formatting

fb81977

agrski marked this pull request as ready for review August 24, 2023 18:31

agrski merged commit e182d08 into SeldonIO:v2 Aug 25, 2023
5 checks passed

agrski deleted the update-model-status-when-insufficient-server-replicas branch August 25, 2023 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scheduler): Match models requirements for servers not replicas & improve status handling #5107

feat(scheduler): Match models requirements for servers not replicas & improve status handling #5107

agrski commented Aug 23, 2023 •

edited

Loading

jesse-c left a comment

jesse-c left a comment

agrski Aug 24, 2023

agrski Aug 24, 2023

jesse-c Aug 25, 2023

agrski Aug 25, 2023

feat(scheduler): Match models requirements for servers not replicas & improve status handling #5107

feat(scheduler): Match models requirements for servers not replicas & improve status handling #5107

Conversation

agrski commented Aug 23, 2023 • edited Loading

Why

Motivation

What

Changes

Testing

jesse-c left a comment

Choose a reason for hiding this comment

jesse-c left a comment

Choose a reason for hiding this comment

agrski Aug 24, 2023

Choose a reason for hiding this comment

agrski Aug 24, 2023

Choose a reason for hiding this comment

jesse-c Aug 25, 2023

Choose a reason for hiding this comment

agrski Aug 25, 2023

Choose a reason for hiding this comment

agrski commented Aug 23, 2023 •

edited

Loading