Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Connectors and Tasks in Isolated Pods #96

Closed
wants to merge 2 commits into from

Conversation

gharris1727
Copy link

Hi all!

This is a sister proposal to KIP-987: Connect Static Assignments

The goal of this proposal is to alleviate the operational headaches that come from shared resource pooling within Kafka Connect.

This is my first proposal to Strimzi, and I don't have a lot of familiarity with it as a user or developer. Please don't hesitate to point out parts that don't make sense, as I may have missed something just by reading the public documentation.

Thanks!

Signed-off-by: Greg Harris <greg.harris@aiven.io>
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal. I had a quick look at it ... I think technically it would be feasible, but the complexity would be quite high. The main issues I see are around these two points:

  • First one is the mixing of the two scheduling types. I think having a Connect cluster with the static assignment might make sense in some case. Having a cluster with the shared assignment would make sense. But mixing them together raises a lot of edge-cases and increases the implementation complexity singificantly.
  • Second is the issue that the tasks are being scaled from inside Connect and not from Kubernetes level. That makes things significantly more complex as well. Things would be much easier if one could use the maximum number of tasks as number of replicas. But that is of course not how Connect works. But if nothing else, I think it would be much easier if the tasks are spawned as PENDING in Connect and Strimzi picks it up from the REST API and starts a new Pod and assigns them. that seems easier then the approach to have them spawned in some shared pod and then picked up from there and moved.

I also wonder if you considered to have the KafkaConnector represent the Connect cluster. If you want to run the connectors like this, do they really need to share the same distributed Connect cluster? Are there any advantages to it? Are there any risks that with too many small pods (each pod == one Connect node) the scalability of the Connect control plane becomes an issue?


The thing I probably struggle most is the motivation behind the design. You explained why sharing the JVM for basically random connectors is wrong. And you touched on why the resources for the connector and various tasks will differ. But I think it is a bit missing an comparison with other technologies. Why is Connect worth this investment? Isn't it better to instead shift the focus on some integration projects that are more cloud native by design such as Apache Camel? The current Connect design clearly has it limitations when running on Kubernetes. But for me personally, it is not clear if this is something worth saving and redesigning. So it would be great if theproposal can elaborate on this a bit as well.

This mode will use the Static Assignments feature to assign connectors and tasks each to their own worker.
This will be opt-in, and able to be performed as a live migration to an existing cluster.

A KafkaConnectorSpec will accept a new optional ConnectorResourceRequirements field, to be used for the Connector instance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the connector resources and task resources be expected to include the JVM overhead? Or will that be something extra configured somewhere else?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagined that for clarity it would forward the requirements directly to the pod with no intermediate math, so these resource requirements would all need to include the JVM overhead. Are you familiar with another situation where a k8s operator computes the resource limits in a similar way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was more of a question than a suggestion we should calculate it. I'm fine if the JVM overhead is expected to be included here. Might be worth stating it in the proposal.

@gharris1727
Copy link
Author

Hey @scholzj Thanks for the quick first pass! You had some excellent questions that cut right to the core of the feature.

But mixing them together raises a lot of edge-cases and increases the implementation complexity singificantly.

This is mostly borne out of trying to model the intermediate state when rolling between fully-shared and fully-isolated cluster. I figured that a feature which required a stop-the-world migration would be significantly less user-friendly, enough to justify some extra complexity in the scheduling algorithm. I figured that if the Static Assignments feature could express those states, it would also make sense to expose that to Strimzi users.

If you believe that Strimzi should prevent users from using hybrid-scheduling as a matter of being opinionated, I think we could delay emitting static assignments until every KafkaConnector has ResourceRequirements attached, and/or some enable switch is flipped. Modeling the intermediate states then just becomes an implementation detail of the rolling migration rather than a user-facing feature.

Things would be much easier if one could use the maximum number of tasks as number of replicas. But that is of course not how Connect works.

We have a loose proxy for this with the tasks.max configuration, but it's technically not enforced by the framework. A connector could go over the advised limit, but I don't think i've seen any implementations do that except by mistake.
We could have all of the pod provisioning happen based on the KafkaConnector spec by looking at the tasks.max, so the control flow doesn't have to depend on the contents of the REST API.

I also wonder if you considered to have the KafkaConnector represent the Connect cluster. If you want to run the connectors like this, do they really need to share the same distributed Connect cluster? Are there any advantages to it?

A KafkaConnect cluster certainly has less relevance in a fully-isolated cluster. But the workers of a fully-isolated cluster will still share a config topic, rest api, group coordinator, etc. The cost of those resources (however minor) does get amortized. But possibly the more important benefit is inheriting Kafka client configurations and secrets. To eliminate the KafkaConnect abstraction from Strimzi would require adding at least some of the worker config properties to the KafkaConnector abstraction.

Are there any risks that with too many small pods (each pod == one Connect node) the scalability of the Connect control plane becomes an issue?

This is certainly a risk, and I will need to do some practical experiments to evaluate how a connect cluster behaves with 100-1000 lightweight nodes rather than 10-100 heavy nodes. If this did become a problem, it might mean that we need to make other investments into the Connect scalability before this feature will be safe to use for large clusters.

But I think it is a bit missing an comparison with other technologies. Why is Connect worth this investment? Isn't it better to instead shift the focus on some integration projects that are more cloud native by design such as Apache Camel? The current Connect design clearly has it limitations when running on Kubernetes. But for me personally, it is not clear if this is something worth saving and redesigning. So it would be great if theproposal can elaborate on this a bit as well.

I think that the ease of operation is an extremely good criteria for user to consider when evaluating whether to use Connect or Camel. Someone could certainly use Connect, and then realize that it's inferior to Camel, and then migrate to Camel and find that it solves their problems. I have not heard of an instance myself, but it's probably happened at least once.

But this proposal is not for users evaluating Connect, it is an incremental improvement for operators that have already made the choice or had the choice made for them. In the next draft I'll make sure to advocate more for the operators.

Copy link
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Greg! I left a few comments. I think it's worth elaborating in a bit more detail what the KafkaConnector API would look like to users. An example spec and perhaps status.conditions for some of the unhappy paths would go a long way to making it more concrete to readers of the proposal.

060-run-connectors-and-tasks-in-isolated-pods.md Outdated Show resolved Hide resolved
060-run-connectors-and-tasks-in-isolated-pods.md Outdated Show resolved Hide resolved
060-run-connectors-and-tasks-in-isolated-pods.md Outdated Show resolved Hide resolved
Comment on lines +46 to +50
For a cluster which has ResourceRequirements for all KafkaConnectors, the shared workers will typically be empty, functioning as hot standby instances.
Hot-standby workers are assigned connectors and tasks which do not have a live static worker, either because that pod is offline or Strimzi has not created it yet.
Connectors may dynamically add tasks, and if strimzi is offline or slow to respond, those tasks will be assigned to hot-standby workers.
If no hot-standby workers are available, tasks will remain unassigned until the associated static worker pod rejoins the group.
If no hot-standby workers are available, connectors and tasks will be disallowed from using shared JVMs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would seem to imply the possibility of repeated reassignment of tasks between workers when tasks are more dynamic that the polling interval used in the operator.

  1. Connector reconfigures dynamic tasks
  2. Assigned to shared worker
  3. Operator creates pod for static worker
  4. Reassigned to static worker
  5. Connector reconfigures dynamic tasks...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the number of tasks that a connector emits is flapping, so will the number of pods. This wouldn't happen if the set of pods was determined by the tasksMax. The operator could also apply some hysteresis: after a task disappears, delay stopping it's worker until some time has passed.

Also note that reconfiguring the connectors is not the same thing as changing the number of tasks. The connector can reconfigure it's tasks without waiting for the management layer to provision new pods if the number of tasks doesn't change.

Comment on lines +64 to +73
1. Upgrade the Connect image to a version which supports static assignments feature.
2. Move the desired Connectors to isolated mode individually. For each connector:
1. Estimate the resources needed for the connector and each task, possibly by dividing the KafkaConnect resource requirements by the average tasks-per-worker.
2. Apply the resource requirements to the connector and task defaults, and wait for corresponding pods to be created.
3. Verify the health of the connector & tasks, and compare the resource utilization to the estimates. Increase or decrease requirements as needed.
4. If the Kubernetes cluster cannot schedule the pods due to resource constraints, consider adding a new node or reducing the KafkaConnect replicas.
5. If the KafkaConnect workers are lightly loaded, consider reducing the KafkaConnect replicas.
3. If all Connectors now have resource requirements specified, reduce the KafkaConnect replicas to the number of desired hot-standby workers.
1. The maximum number of recommended hot-standbys is the number of replicas that was previously in-use before applying per-connector/task ResourceRequirements.
2. The minimum number of recommended hot-standbys is 0, which disables hot-standbys entirely and forces the use of isolated JVMs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all makes sense, but it would really help if the proposal defined how the status of KafkaConnector and KafkaConnect would look in some of these circumstances. In other words how does the end user know, from looking at their KafkaConnector and KafkaConnect resources, that Kube cannot schedule the dedicated pod(s) required for a connector? Or how do they easily know the number of hot-standbys?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the way that Strimzi uses statuses to surface this sort of information, and what would be idiomatic for users. If the current Status system exposes the UNASSIGNED state used by the framework, a lack of dedicated pods or hot-standbys would appear as a job never getting an assignment.

I'll look into this more and come up with some status examples.

060-run-connectors-and-tasks-in-isolated-pods.md Outdated Show resolved Hide resolved
@katheris
Copy link
Contributor

Hi Greg, this is definitely an interesting proposal. On your earlier comment:

We have a loose proxy for this with the tasks.max configuration, but it's technically not enforced by the framework. A connector could go over the advised limit, but I don't think i've seen any implementations do that except by mistake.
We could have all of the pod provisioning happen based on the KafkaConnector spec by looking at the tasks.max, so the control flow doesn't have to depend on the contents of the REST API.

I think the tricky thing here is although in general connectors don't go over the tasks.max configuration, there are connectors that start fewer tasks. For example the Confluent JDBC connector only starts a single task when running in query mode, and otherwise it creates the number of tasks equal to either tasks.max or the number of tables in the database it is streaming updates from, whichever is lower.

So if the operator did base the number of pods on the tasks.max configuration then some pods might be left without any workload running on them. I wasn't sure from your KIP what happens if a worker is configured with static.tasks set to a task id that actually doesn't exist.

…on and rejected alternatives

Signed-off-by: Greg Harris <greg.harris@aiven.io>
@gharris1727
Copy link
Author

Thanks for the review @katheris !

So if the operator did base the number of pods on the tasks.max configuration then some pods might be left without any workload running on them. I wasn't sure from your KIP what happens if a worker is configured with static.tasks set to a task id that actually doesn't exist.

Thanks for pointing this out, I've clarified in the KIP that the values in static.connectors and static.tasks need not exist, in order to allow for over-provisioning task pods and bootstrapping clusters with 0 replicas. This means that if a connector has fewer tasks than tasks.max, those additional pods will just remain idle, but will immediately pick up the task if the connector creates it.

Whether Strimzi uses the closed-loop polling to find the actual number of tasks or infers the number of tasks for tasks.max is a design tradeoff. We can either complicate the operator, or allow users to over-provision pods by setting a tasks.max which is too high. I don't feel strongly either way, and I'll make sure that the underlying KIP supports both strategies.

@scholzj
Copy link
Member

scholzj commented Jan 25, 2024

Triaged on the Strimzi Community call on 25.1.2024: @gharris1727 What are the plans for this proposal going forward? Should we keep it open? Should we close it until the KIP related to this moves forward?

@gharris1727
Copy link
Author

Hi @scholzj

The current KIP-987 design needs another pass, as the UX of the feature is poor for human operators, and is effectively only usable by machine operators. It is not ready for a vote at this time. The next best design that I know of right now is implementing a taint/selector system within Connect, and using a set of unique task-id/connector-id selectors to give Strimzi complete control over scheduling.

As far as keeping this open or reopening this later: It costs nothing to press the button so i'm fine with either. Since the underlying feature isn't ready I suppose there's no action to take here at the moment, but I am still interested in implementing this proposal at some point in the future.

@ppatierno
Copy link
Member

Triaged on 17/10/2024: agreed on closing the pull request for now and maybe re-open it, or a new one, when KIP-987 will be approved and this one can move forward.

@ppatierno ppatierno closed this Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants