-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run Connectors and Tasks in Isolated Pods #96
Conversation
Signed-off-by: Greg Harris <greg.harris@aiven.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the proposal. I had a quick look at it ... I think technically it would be feasible, but the complexity would be quite high. The main issues I see are around these two points:
- First one is the mixing of the two scheduling types. I think having a Connect cluster with the static assignment might make sense in some case. Having a cluster with the shared assignment would make sense. But mixing them together raises a lot of edge-cases and increases the implementation complexity singificantly.
- Second is the issue that the tasks are being scaled from inside Connect and not from Kubernetes level. That makes things significantly more complex as well. Things would be much easier if one could use the maximum number of tasks as number of replicas. But that is of course not how Connect works. But if nothing else, I think it would be much easier if the tasks are spawned as
PENDING
in Connect and Strimzi picks it up from the REST API and starts a new Pod and assigns them. that seems easier then the approach to have them spawned in some shared pod and then picked up from there and moved.
I also wonder if you considered to have the KafkaConnector represent the Connect cluster. If you want to run the connectors like this, do they really need to share the same distributed Connect cluster? Are there any advantages to it? Are there any risks that with too many small pods (each pod == one Connect node) the scalability of the Connect control plane becomes an issue?
The thing I probably struggle most is the motivation behind the design. You explained why sharing the JVM for basically random connectors is wrong. And you touched on why the resources for the connector and various tasks will differ. But I think it is a bit missing an comparison with other technologies. Why is Connect worth this investment? Isn't it better to instead shift the focus on some integration projects that are more cloud native by design such as Apache Camel? The current Connect design clearly has it limitations when running on Kubernetes. But for me personally, it is not clear if this is something worth saving and redesigning. So it would be great if theproposal can elaborate on this a bit as well.
This mode will use the Static Assignments feature to assign connectors and tasks each to their own worker. | ||
This will be opt-in, and able to be performed as a live migration to an existing cluster. | ||
|
||
A KafkaConnectorSpec will accept a new optional ConnectorResourceRequirements field, to be used for the Connector instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the connector resources and task resources be expected to include the JVM overhead? Or will that be something extra configured somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagined that for clarity it would forward the requirements directly to the pod with no intermediate math, so these resource requirements would all need to include the JVM overhead. Are you familiar with another situation where a k8s operator computes the resource limits in a similar way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was more of a question than a suggestion we should calculate it. I'm fine if the JVM overhead is expected to be included here. Might be worth stating it in the proposal.
Hey @scholzj Thanks for the quick first pass! You had some excellent questions that cut right to the core of the feature.
This is mostly borne out of trying to model the intermediate state when rolling between fully-shared and fully-isolated cluster. I figured that a feature which required a stop-the-world migration would be significantly less user-friendly, enough to justify some extra complexity in the scheduling algorithm. I figured that if the Static Assignments feature could express those states, it would also make sense to expose that to Strimzi users. If you believe that Strimzi should prevent users from using hybrid-scheduling as a matter of being opinionated, I think we could delay emitting static assignments until every KafkaConnector has ResourceRequirements attached, and/or some enable switch is flipped. Modeling the intermediate states then just becomes an implementation detail of the rolling migration rather than a user-facing feature.
We have a loose proxy for this with the
A KafkaConnect cluster certainly has less relevance in a fully-isolated cluster. But the workers of a fully-isolated cluster will still share a config topic, rest api, group coordinator, etc. The cost of those resources (however minor) does get amortized. But possibly the more important benefit is inheriting Kafka client configurations and secrets. To eliminate the KafkaConnect abstraction from Strimzi would require adding at least some of the worker config properties to the KafkaConnector abstraction.
This is certainly a risk, and I will need to do some practical experiments to evaluate how a connect cluster behaves with 100-1000 lightweight nodes rather than 10-100 heavy nodes. If this did become a problem, it might mean that we need to make other investments into the Connect scalability before this feature will be safe to use for large clusters.
I think that the ease of operation is an extremely good criteria for user to consider when evaluating whether to use Connect or Camel. Someone could certainly use Connect, and then realize that it's inferior to Camel, and then migrate to Camel and find that it solves their problems. I have not heard of an instance myself, but it's probably happened at least once. But this proposal is not for users evaluating Connect, it is an incremental improvement for operators that have already made the choice or had the choice made for them. In the next draft I'll make sure to advocate more for the operators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Greg! I left a few comments. I think it's worth elaborating in a bit more detail what the KafkaConnector
API would look like to users. An example spec
and perhaps status.conditions
for some of the unhappy paths would go a long way to making it more concrete to readers of the proposal.
For a cluster which has ResourceRequirements for all KafkaConnectors, the shared workers will typically be empty, functioning as hot standby instances. | ||
Hot-standby workers are assigned connectors and tasks which do not have a live static worker, either because that pod is offline or Strimzi has not created it yet. | ||
Connectors may dynamically add tasks, and if strimzi is offline or slow to respond, those tasks will be assigned to hot-standby workers. | ||
If no hot-standby workers are available, tasks will remain unassigned until the associated static worker pod rejoins the group. | ||
If no hot-standby workers are available, connectors and tasks will be disallowed from using shared JVMs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would seem to imply the possibility of repeated reassignment of tasks between workers when tasks are more dynamic that the polling interval used in the operator.
- Connector reconfigures dynamic tasks
- Assigned to shared worker
- Operator creates pod for static worker
- Reassigned to static worker
- Connector reconfigures dynamic tasks...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if the number of tasks that a connector emits is flapping, so will the number of pods. This wouldn't happen if the set of pods was determined by the tasksMax. The operator could also apply some hysteresis: after a task disappears, delay stopping it's worker until some time has passed.
Also note that reconfiguring the connectors is not the same thing as changing the number of tasks. The connector can reconfigure it's tasks without waiting for the management layer to provision new pods if the number of tasks doesn't change.
1. Upgrade the Connect image to a version which supports static assignments feature. | ||
2. Move the desired Connectors to isolated mode individually. For each connector: | ||
1. Estimate the resources needed for the connector and each task, possibly by dividing the KafkaConnect resource requirements by the average tasks-per-worker. | ||
2. Apply the resource requirements to the connector and task defaults, and wait for corresponding pods to be created. | ||
3. Verify the health of the connector & tasks, and compare the resource utilization to the estimates. Increase or decrease requirements as needed. | ||
4. If the Kubernetes cluster cannot schedule the pods due to resource constraints, consider adding a new node or reducing the KafkaConnect replicas. | ||
5. If the KafkaConnect workers are lightly loaded, consider reducing the KafkaConnect replicas. | ||
3. If all Connectors now have resource requirements specified, reduce the KafkaConnect replicas to the number of desired hot-standby workers. | ||
1. The maximum number of recommended hot-standbys is the number of replicas that was previously in-use before applying per-connector/task ResourceRequirements. | ||
2. The minimum number of recommended hot-standbys is 0, which disables hot-standbys entirely and forces the use of isolated JVMs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all makes sense, but it would really help if the proposal defined how the status
of KafkaConnector
and KafkaConnect
would look in some of these circumstances. In other words how does the end user know, from looking at their KafkaConnector
and KafkaConnect
resources, that Kube cannot schedule the dedicated pod(s) required for a connector? Or how do they easily know the number of hot-standbys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the way that Strimzi uses statuses to surface this sort of information, and what would be idiomatic for users. If the current Status system exposes the UNASSIGNED state used by the framework, a lack of dedicated pods or hot-standbys would appear as a job never getting an assignment.
I'll look into this more and come up with some status examples.
Hi Greg, this is definitely an interesting proposal. On your earlier comment:
I think the tricky thing here is although in general connectors don't go over the So if the operator did base the number of pods on the |
…on and rejected alternatives Signed-off-by: Greg Harris <greg.harris@aiven.io>
Thanks for the review @katheris !
Thanks for pointing this out, I've clarified in the KIP that the values in static.connectors and static.tasks need not exist, in order to allow for over-provisioning task pods and bootstrapping clusters with 0 replicas. This means that if a connector has fewer tasks than tasks.max, those additional pods will just remain idle, but will immediately pick up the task if the connector creates it. Whether Strimzi uses the closed-loop polling to find the actual number of tasks or infers the number of tasks for tasks.max is a design tradeoff. We can either complicate the operator, or allow users to over-provision pods by setting a tasks.max which is too high. I don't feel strongly either way, and I'll make sure that the underlying KIP supports both strategies. |
Triaged on the Strimzi Community call on 25.1.2024: @gharris1727 What are the plans for this proposal going forward? Should we keep it open? Should we close it until the KIP related to this moves forward? |
Hi @scholzj The current KIP-987 design needs another pass, as the UX of the feature is poor for human operators, and is effectively only usable by machine operators. It is not ready for a vote at this time. The next best design that I know of right now is implementing a taint/selector system within Connect, and using a set of unique task-id/connector-id selectors to give Strimzi complete control over scheduling. As far as keeping this open or reopening this later: It costs nothing to press the button so i'm fine with either. Since the underlying feature isn't ready I suppose there's no action to take here at the moment, but I am still interested in implementing this proposal at some point in the future. |
Triaged on 17/10/2024: agreed on closing the pull request for now and maybe re-open it, or a new one, when KIP-987 will be approved and this one can move forward. |
Hi all!
This is a sister proposal to KIP-987: Connect Static Assignments
The goal of this proposal is to alleviate the operational headaches that come from shared resource pooling within Kafka Connect.
This is my first proposal to Strimzi, and I don't have a lot of familiarity with it as a user or developer. Please don't hesitate to point out parts that don't make sense, as I may have missed something just by reading the public documentation.
Thanks!