Run Connectors and Tasks in Isolated Pods #96

gharris1727 · 2023-10-05T23:18:35Z

Hi all!

This is a sister proposal to KIP-987: Connect Static Assignments

The goal of this proposal is to alleviate the operational headaches that come from shared resource pooling within Kafka Connect.

This is my first proposal to Strimzi, and I don't have a lot of familiarity with it as a user or developer. Please don't hesitate to point out parts that don't make sense, as I may have missed something just by reading the public documentation.

Thanks!

Signed-off-by: Greg Harris <greg.harris@aiven.io>

scholzj

Thanks for the proposal. I had a quick look at it ... I think technically it would be feasible, but the complexity would be quite high. The main issues I see are around these two points:

First one is the mixing of the two scheduling types. I think having a Connect cluster with the static assignment might make sense in some case. Having a cluster with the shared assignment would make sense. But mixing them together raises a lot of edge-cases and increases the implementation complexity singificantly.
Second is the issue that the tasks are being scaled from inside Connect and not from Kubernetes level. That makes things significantly more complex as well. Things would be much easier if one could use the maximum number of tasks as number of replicas. But that is of course not how Connect works. But if nothing else, I think it would be much easier if the tasks are spawned as PENDING in Connect and Strimzi picks it up from the REST API and starts a new Pod and assigns them. that seems easier then the approach to have them spawned in some shared pod and then picked up from there and moved.

I also wonder if you considered to have the KafkaConnector represent the Connect cluster. If you want to run the connectors like this, do they really need to share the same distributed Connect cluster? Are there any advantages to it? Are there any risks that with too many small pods (each pod == one Connect node) the scalability of the Connect control plane becomes an issue?

The thing I probably struggle most is the motivation behind the design. You explained why sharing the JVM for basically random connectors is wrong. And you touched on why the resources for the connector and various tasks will differ. But I think it is a bit missing an comparison with other technologies. Why is Connect worth this investment? Isn't it better to instead shift the focus on some integration projects that are more cloud native by design such as Apache Camel? The current Connect design clearly has it limitations when running on Kubernetes. But for me personally, it is not clear if this is something worth saving and redesigning. So it would be great if theproposal can elaborate on this a bit as well.

scholzj · 2023-10-06T00:07:05Z

060-run-connectors-and-tasks-in-isolated-pods.md

+This mode will use the Static Assignments feature to assign connectors and tasks each to their own worker.
+This will be opt-in, and able to be performed as a live migration to an existing cluster.
+
+A KafkaConnectorSpec will accept a new optional ConnectorResourceRequirements field, to be used for the Connector instance.


Would the connector resources and task resources be expected to include the JVM overhead? Or will that be something extra configured somewhere else?

I imagined that for clarity it would forward the requirements directly to the pod with no intermediate math, so these resource requirements would all need to include the JVM overhead. Are you familiar with another situation where a k8s operator computes the resource limits in a similar way?

It was more of a question than a suggestion we should calculate it. I'm fine if the JVM overhead is expected to be included here. Might be worth stating it in the proposal.

gharris1727 · 2023-10-06T01:43:54Z

Hey @scholzj Thanks for the quick first pass! You had some excellent questions that cut right to the core of the feature.

But mixing them together raises a lot of edge-cases and increases the implementation complexity singificantly.

This is mostly borne out of trying to model the intermediate state when rolling between fully-shared and fully-isolated cluster. I figured that a feature which required a stop-the-world migration would be significantly less user-friendly, enough to justify some extra complexity in the scheduling algorithm. I figured that if the Static Assignments feature could express those states, it would also make sense to expose that to Strimzi users.

If you believe that Strimzi should prevent users from using hybrid-scheduling as a matter of being opinionated, I think we could delay emitting static assignments until every KafkaConnector has ResourceRequirements attached, and/or some enable switch is flipped. Modeling the intermediate states then just becomes an implementation detail of the rolling migration rather than a user-facing feature.

Things would be much easier if one could use the maximum number of tasks as number of replicas. But that is of course not how Connect works.

We have a loose proxy for this with the tasks.max configuration, but it's technically not enforced by the framework. A connector could go over the advised limit, but I don't think i've seen any implementations do that except by mistake.
We could have all of the pod provisioning happen based on the KafkaConnector spec by looking at the tasks.max, so the control flow doesn't have to depend on the contents of the REST API.

I also wonder if you considered to have the KafkaConnector represent the Connect cluster. If you want to run the connectors like this, do they really need to share the same distributed Connect cluster? Are there any advantages to it?

A KafkaConnect cluster certainly has less relevance in a fully-isolated cluster. But the workers of a fully-isolated cluster will still share a config topic, rest api, group coordinator, etc. The cost of those resources (however minor) does get amortized. But possibly the more important benefit is inheriting Kafka client configurations and secrets. To eliminate the KafkaConnect abstraction from Strimzi would require adding at least some of the worker config properties to the KafkaConnector abstraction.

Are there any risks that with too many small pods (each pod == one Connect node) the scalability of the Connect control plane becomes an issue?

This is certainly a risk, and I will need to do some practical experiments to evaluate how a connect cluster behaves with 100-1000 lightweight nodes rather than 10-100 heavy nodes. If this did become a problem, it might mean that we need to make other investments into the Connect scalability before this feature will be safe to use for large clusters.

But I think it is a bit missing an comparison with other technologies. Why is Connect worth this investment? Isn't it better to instead shift the focus on some integration projects that are more cloud native by design such as Apache Camel? The current Connect design clearly has it limitations when running on Kubernetes. But for me personally, it is not clear if this is something worth saving and redesigning. So it would be great if theproposal can elaborate on this a bit as well.

I think that the ease of operation is an extremely good criteria for user to consider when evaluating whether to use Connect or Camel. Someone could certainly use Connect, and then realize that it's inferior to Camel, and then migrate to Camel and find that it solves their problems. I have not heard of an instance myself, but it's probably happened at least once.

But this proposal is not for users evaluating Connect, it is an incremental improvement for operators that have already made the choice or had the choice made for them. In the next draft I'll make sure to advocate more for the operators.

tombentley

Thanks Greg! I left a few comments. I think it's worth elaborating in a bit more detail what the KafkaConnector API would look like to users. An example spec and perhaps status.conditions for some of the unhappy paths would go a long way to making it more concrete to readers of the proposal.

060-run-connectors-and-tasks-in-isolated-pods.md

tombentley · 2023-10-10T03:21:44Z

060-run-connectors-and-tasks-in-isolated-pods.md

+For a cluster which has ResourceRequirements for all KafkaConnectors, the shared workers will typically be empty, functioning as hot standby instances.
+Hot-standby workers are assigned connectors and tasks which do not have a live static worker, either because that pod is offline or Strimzi has not created it yet.
+Connectors may dynamically add tasks, and if strimzi is offline or slow to respond, those tasks will be assigned to hot-standby workers.
+If no hot-standby workers are available, tasks will remain unassigned until the associated static worker pod rejoins the group.
+If no hot-standby workers are available, connectors and tasks will be disallowed from using shared JVMs.


This would seem to imply the possibility of repeated reassignment of tasks between workers when tasks are more dynamic that the polling interval used in the operator.

Connector reconfigures dynamic tasks

Assigned to shared worker

Operator creates pod for static worker

Reassigned to static worker

Connector reconfigures dynamic tasks...

Yes, if the number of tasks that a connector emits is flapping, so will the number of pods. This wouldn't happen if the set of pods was determined by the tasksMax. The operator could also apply some hysteresis: after a task disappears, delay stopping it's worker until some time has passed.

Also note that reconfiguring the connectors is not the same thing as changing the number of tasks. The connector can reconfigure it's tasks without waiting for the management layer to provision new pods if the number of tasks doesn't change.

tombentley · 2023-10-10T03:56:05Z

060-run-connectors-and-tasks-in-isolated-pods.md

+1. Upgrade the Connect image to a version which supports static assignments feature.
+2. Move the desired Connectors to isolated mode individually. For each connector:
+   1. Estimate the resources needed for the connector and each task, possibly by dividing the KafkaConnect resource requirements by the average tasks-per-worker.
+   2. Apply the resource requirements to the connector and task defaults, and wait for corresponding pods to be created.
+   3. Verify the health of the connector & tasks, and compare the resource utilization to the estimates. Increase or decrease requirements as needed.
+   4. If the Kubernetes cluster cannot schedule the pods due to resource constraints, consider adding a new node or reducing the KafkaConnect replicas.
+   5. If the KafkaConnect workers are lightly loaded, consider reducing the KafkaConnect replicas.
+3. If all Connectors now have resource requirements specified, reduce the KafkaConnect replicas to the number of desired hot-standby workers.
+   1. The maximum number of recommended hot-standbys is the number of replicas that was previously in-use before applying per-connector/task ResourceRequirements.
+   2. The minimum number of recommended hot-standbys is 0, which disables hot-standbys entirely and forces the use of isolated JVMs.


This all makes sense, but it would really help if the proposal defined how the status of KafkaConnector and KafkaConnect would look in some of these circumstances. In other words how does the end user know, from looking at their KafkaConnector and KafkaConnect resources, that Kube cannot schedule the dedicated pod(s) required for a connector? Or how do they easily know the number of hot-standbys?

I'm not familiar with the way that Strimzi uses statuses to surface this sort of information, and what would be idiomatic for users. If the current Status system exposes the UNASSIGNED state used by the framework, a lack of dedicated pods or hot-standbys would appear as a job never getting an assignment.

I'll look into this more and come up with some status examples.

060-run-connectors-and-tasks-in-isolated-pods.md

katheris · 2023-10-10T15:37:13Z

Hi Greg, this is definitely an interesting proposal. On your earlier comment:

We have a loose proxy for this with the tasks.max configuration, but it's technically not enforced by the framework. A connector could go over the advised limit, but I don't think i've seen any implementations do that except by mistake.
We could have all of the pod provisioning happen based on the KafkaConnector spec by looking at the tasks.max, so the control flow doesn't have to depend on the contents of the REST API.

I think the tricky thing here is although in general connectors don't go over the tasks.max configuration, there are connectors that start fewer tasks. For example the Confluent JDBC connector only starts a single task when running in query mode, and otherwise it creates the number of tasks equal to either tasks.max or the number of tables in the database it is streaming updates from, whichever is lower.

So if the operator did base the number of pods on the tasks.max configuration then some pods might be left without any workload running on them. I wasn't sure from your KIP what happens if a worker is configured with static.tasks set to a task id that actually doesn't exist.

…on and rejected alternatives Signed-off-by: Greg Harris <greg.harris@aiven.io>

gharris1727 · 2023-10-10T19:27:43Z

Thanks for the review @katheris !

So if the operator did base the number of pods on the tasks.max configuration then some pods might be left without any workload running on them. I wasn't sure from your KIP what happens if a worker is configured with static.tasks set to a task id that actually doesn't exist.

Thanks for pointing this out, I've clarified in the KIP that the values in static.connectors and static.tasks need not exist, in order to allow for over-provisioning task pods and bootstrapping clusters with 0 replicas. This means that if a connector has fewer tasks than tasks.max, those additional pods will just remain idle, but will immediately pick up the task if the connector creates it.

Whether Strimzi uses the closed-loop polling to find the actual number of tasks or infers the number of tasks for tasks.max is a design tradeoff. We can either complicate the operator, or allow users to over-provision pods by setting a tasks.max which is too high. I don't feel strongly either way, and I'll make sure that the underlying KIP supports both strategies.

scholzj · 2024-01-25T16:28:11Z

Triaged on the Strimzi Community call on 25.1.2024: @gharris1727 What are the plans for this proposal going forward? Should we keep it open? Should we close it until the KIP related to this moves forward?

gharris1727 · 2024-01-25T18:47:49Z

Hi @scholzj

The current KIP-987 design needs another pass, as the UX of the feature is poor for human operators, and is effectively only usable by machine operators. It is not ready for a vote at this time. The next best design that I know of right now is implementing a taint/selector system within Connect, and using a set of unique task-id/connector-id selectors to give Strimzi complete control over scheduling.

As far as keeping this open or reopening this later: It costs nothing to press the button so i'm fine with either. Since the underlying feature isn't ready I suppose there's no action to take here at the moment, but I am still interested in implementing this proposal at some point in the future.

ppatierno · 2024-10-17T08:17:14Z

Triaged on 17/10/2024: agreed on closing the pull request for now and maybe re-open it, or a new one, when KIP-987 will be approved and this one can move forward.

First draft of the Connect Isolation proposal

42e44b2

Signed-off-by: Greg Harris <greg.harris@aiven.io>

scholzj reviewed Oct 6, 2023

View reviewed changes

tombentley reviewed Oct 10, 2023

View reviewed changes

Add example connector spec with resource limits, add more justificati…

780f1ae

…on and rejected alternatives Signed-off-by: Greg Harris <greg.harris@aiven.io>

ppatierno closed this Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run Connectors and Tasks in Isolated Pods #96

Run Connectors and Tasks in Isolated Pods #96

gharris1727 commented Oct 5, 2023

scholzj left a comment

scholzj Oct 6, 2023

gharris1727 Oct 6, 2023

scholzj Oct 6, 2023

gharris1727 commented Oct 6, 2023

tombentley left a comment

tombentley Oct 10, 2023

gharris1727 Oct 10, 2023

tombentley Oct 10, 2023

gharris1727 Oct 10, 2023

katheris commented Oct 10, 2023

gharris1727 commented Oct 10, 2023

scholzj commented Jan 25, 2024

gharris1727 commented Jan 25, 2024

ppatierno commented Oct 17, 2024

Run Connectors and Tasks in Isolated Pods #96

Run Connectors and Tasks in Isolated Pods #96

Conversation

gharris1727 commented Oct 5, 2023

scholzj left a comment

Choose a reason for hiding this comment

scholzj Oct 6, 2023

Choose a reason for hiding this comment

gharris1727 Oct 6, 2023

Choose a reason for hiding this comment

scholzj Oct 6, 2023

Choose a reason for hiding this comment

gharris1727 commented Oct 6, 2023

tombentley left a comment

Choose a reason for hiding this comment

tombentley Oct 10, 2023

Choose a reason for hiding this comment

gharris1727 Oct 10, 2023

Choose a reason for hiding this comment

tombentley Oct 10, 2023

Choose a reason for hiding this comment

gharris1727 Oct 10, 2023

Choose a reason for hiding this comment

katheris commented Oct 10, 2023

gharris1727 commented Oct 10, 2023

scholzj commented Jan 25, 2024

gharris1727 commented Jan 25, 2024

ppatierno commented Oct 17, 2024