Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing new KafkaRoller #103

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

tinaselenge
Copy link
Contributor

@tinaselenge tinaselenge commented Jan 2, 2024

For more implementation details, the POC implementation code can be checked in RackRolling.java. All the related classes are in the same package , rolling.

The tests illustrating various cases with different set of configurations is in RackRollingTest.java.

The logic for switching to the new roller is in KafkaReconciler.java class.

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Made some improvements on the structure

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a first pass, as I need more time to digest this. I think it would be useful to illustrate the new behavior with a couple of examples of the form: with this roller configuration and cluster state, these are the node groups and their restart order. Wdyt?

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
@tinaselenge tinaselenge force-pushed the kafka-roller-2 branch 2 times, most recently from c060c24 to 433316f Compare March 15, 2024 12:11
Tidy up

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
@tinaselenge tinaselenge marked this pull request as ready for review March 15, 2024 12:29
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
@tinaselenge
Copy link
Contributor Author

@fvaleri Thank you for the feedback. I have added an example of rolling update. Please let me know what you think.

Copy link
Member

@see-quick see-quick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice proposal. Thanks for it 👍 .


STs POV:

I think we would need to also design multiple tests to cover all states, which KafkaRoller v2. We have a few tests but for sure that's not 100% coverage. So, we should maybe have a meeting to talk about this...

Side note about performance:

What would be appropriate performance metrics for us to consider when designing performance tests? Are there any critical ones? For sure I can image that we would see significant improvement on RollingUpdates of multiple nodes when we use batching mechanism...

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Co-authored-by: Maros Orsak <maros.orsak159@gmail.com>
Signed-off-by: Gantigmaa Selenge <39860586+tinaselenge@users.noreply.github.com>
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tinaselenge thanks for the example, it really helps.

I left some comments, let me know if something is not clear or you want to discuss further.

06x-new-kafka-roller.md Show resolved Hide resolved
- Cruise Control sends `removingReplicas` request to un-assign the partition from broker 2.
- KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed.
- The reassignment request is processed and foo-0 partition now has ISR [1, 4].
- KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition.
Copy link
Contributor

@fvaleri fvaleri Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to rebalance, we have the same race condition with replication factor change (the new integration between CC and TO), maybe you can mention this.

The roller should be able to call the CC's user_tasks endpoint, and check if there is any pending task. In that case, the roller has two options: wait for all tasks completion, or continue as today with the potential issue you describe here. You can't really stop the tasks because the current batch will still be completed, and the operators will try to submit a new task in the next reconciliation loop.

I think that we should let the user decide which policy to apply through a configuration. By default the roller would wait for all CC tasks to complete, logging a warning. If the user set or switch to "force" policy, then the roller would behave like today. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be perhaps included/discussed in a separate proposal or issue? The idea was to mention that there is a race condition we could fix with the new roller in the future, which is not easy to fix with the old roller. How we fix it and other similar problems should be a separate discussion I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a dedicated proposal IMO, but let's start by logging an issue.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would calling the ListReassigningPartitions API be enough to know this?

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Copy link
Contributor

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me, but I had a few questions and wording suggestions. I definitely think this will be useful since I've experienced first hand how tricky it is to debug the existing code.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
@tinaselenge tinaselenge force-pushed the kafka-roller-2 branch 3 times, most recently from 931adbd to 1060fee Compare April 30, 2024 13:58
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Add possible transitions

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
If none of the above is true but the node is not ready, then its state would be `NOT_READY`.

#### Flow diagram describing the overall flow of the states
![The new roller flow](./images/06x-new-roller-flow.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments about the diagram of the FSM ...

  1. Each state has to be unique and not duplicated.
  2. If you are grouping states together, does it really mean they are different state? Or are they different state with different context? It looks weird to me.
  3. Where you use "Action and Transition State" I think that: Action should be cleared, because it represents an output for the FSM? Is there a real output or just transition? You don't need "Transition State" because it's defined by the arrow you have.

My gut general feeling is that ... you are describing a very complex state machine and transitions in the "Algorithm" section but then the visualization of it is really weak.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the diagram to be more from high level and then broken down with more details in the Algorithm section. Given that, we may have to change various things around states and stuff, I will take this suggestion and recreate the diagram.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the diagram changed after my comments? It seems almost the same. Also, what's "Serving" or "Waiting" or "Restarted" or "Reconfigured" ... they are not listed as states in the table above.
I can't really match the table with the diagram, sorry :-(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppatierno I have updated the diagram now. It's still more of high level flow showing the transitions. The possible states are listed but does not mean they are grouped together. Depending on the state, a different action taken and the possible actions are also listed. Do you think it's clearer now? Or do you think I should break down to each state mapping to the possible actions, instead of listing them together in a same bubble?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well an FSM state machine should have single states (but I can understand your grouping of more than one here, so let's leave it for now). What I can't understand is what's the hexagon and it's content. They are not states, right? I can't see them in the table. So their presence confuses me.
Also you are duplicating circles for Not_Ready/Not_Running/Recovery: you have one with them and Ready, and one with them alone. I think you should split and having Ready on its own, as well as Leading_All_preferred.
Finally as an FSM state diagram it should not contain other stuff like "Start" or "Desired State Reached", you are describing an FSM, and those ones are not states.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's start getting a better shape :-) but I would still improve it by reducing duplications. I think you can have just one "Unknown" (instead of 2) and one "Ready" state (instead of 3). Even because AFAIU, the final desired state is "LeadingAllPreferred" right? and from the current graph not all "Ready" states end there.
Next we'll talk about the pink/red states which have some duplications as well .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to hear that it is in the right direction. I've taken the suggestion and made another update :). Thanks @ppatierno

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does exactly "Iterate" mean on the orange arrows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's repeating the process on node if the desired state is not reached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the diagram with a bit more explanation instead of the "iterate" part.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
## Rejected

- Why not use rack information when batching brokers that can be restarted at the same time?
When all replicas of all partitions have been assigned in a rack-aware way then brokers in the same rack trivially share no partitions, and so racks provide a safe partitioning. However nothing in a broker, controller or cruise control is able to enforce the rack-aware property therefore assuming this property is unsafe. Even if CC is being used and rack aware replicas is a hard goal we can't be certain that other tooling hasn't reassigned some replicas since the last rebalance, or that no topics have been created in a rack-unaware way.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the above is considered a rejected alternative. I mean, this section is for solutions for the same goals which were rejected, while it seems to be used just to highlight that an "idea" within the current proposal was rejected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course, we would need to agree if we are rejecting this idea. Perhaps I should rename this into other ideas considered?

@tinaselenge tinaselenge force-pushed the kafka-roller-2 branch 2 times, most recently from d3e3f1d to 557eb7a Compare June 6, 2024 10:53
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
tinaselenge added a commit to tinaselenge/strimzi-kafka-operator that referenced this pull request Jul 2, 2024
tinaselenge added a commit to tinaselenge/strimzi-kafka-operator that referenced this pull request Jul 2, 2024
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
- Improve the names for categories and states
- Remove restarted/reconfigured states
- Add a configuration for delay between restarts
- Add a configuration for delay between restart and trigger of preferred leader election
- Restart NOT_RUNNING nodes in parallel for quicker recovery
- Improve the overall algorithm section, to make it clearer and concise

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
@tinaselenge
Copy link
Contributor Author

Thanks everyone who reviewed the PR! I believe I now have addressed all the review comments except an update of the diagram (will push that in a follow up commit). @scholzj @ppatierno @fvaleri @tombentley , could you please take another look when you have time? Thank you very much.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
If none of the above is true but the node is not ready, then its state would be `NOT_READY`.

#### Flow diagram describing the overall flow of the states
![The new roller flow](./images/06x-new-roller-flow.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the diagram changed after my comments? It seems almost the same. Also, what's "Serving" or "Waiting" or "Restarted" or "Reconfigured" ... they are not listed as states in the table above.
I can't really match the table with the diagram, sorry :-(

Contexts are recreated in each reconciliation with the above initial data.

2. **Transition Node States:**
Update each node's state based on information from abstracted sources. If failed to retrieve information, the reconciliation fails and restarts from step 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that this is about "loading" or "building" the current state of the node. Usually in our FSMs (i.e. rebalancing), this state is coming from a custom resource, here it's coming from different sources. Maybe we can say it better instead of "update each node's ...." .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what would be better, does Load each node's state based on information... sound better?

We build a context in step 1, which has state UNKNOWN. In this step, we are updating this state based on the information from the sources. So to me, update each node's state sounds fine.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tinaselenge, I had another look at the example and I think is great. Left few more comments, but I think this would work.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
06x-new-kafka-roller.md Show resolved Hide resolved
@tinaselenge tinaselenge force-pushed the kafka-roller-2 branch 3 times, most recently from 4c035fa to bf71ae6 Compare July 19, 2024 08:30
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for answering all my questions. Good job.

@tinaselenge
Copy link
Contributor Author

Thanks for answering all my questions. Good job.

Thank you @fvaleri , I really appreciate you reviewing the proposal thoroughly.

- Although it is safe and straightforward to restart one broker at a time, this process is slow in large clusters ([related issue](https://github.com/strimzi/strimzi-kafka-operator/issues/8547)).
- It does not account for partition preferred leadership. As a result, there may be more leadership changes than necessary during a rolling restart, consequently impacting tail latency.
- Hard to reason about when things go wrong. The code is complex to understand and it's not easy to determine why a pod was restarted from logs that tend to be noisy.
- Potential race condition between Cruise Control rebalance and KafkaRoller that could cause partitions under minimum in sync replica. This issue is described in more detail in the `Future Improvements` section.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general Slack is not really ideal for keeping details of problems in the long term. Better to create an issue, which can be discovered more easily by anyone who faces a similar problem.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
Update each node's state based on information from abstracted sources. If failed to retrieve information, the current reconciliation immediately fails. When the next reconciliation is triggered, it will restart from step 1.

3. **Handle `NOT_READY` Nodes:**
Wait for `NOT_READY` nodes to become `READY` within `operationTimeoutMs`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging from the fact that the next step covers NOT_READY, I'm guessing that we just fall through if the node is still NOT_READY after operationTimeoutMs. But you need to say that! And also explain, if we're prepared to fall though to the next step, why this timeout is even necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have explained why we do the wait and that it falls through the next step.

06x-new-kafka-roller.md Outdated Show resolved Hide resolved
- `NOP`: Nodes needing no operation.

5. **Wait for Log Recovery:**
Wait for `WAIT_FOR_LOG_RECOVERY` nodes to become `READY` within `operationTimeoutMs`. If timeout is reached and `numRetries` exceeds `maxRetries`, throw `UnrestartableNodesException`. Otherwise, increment `numRetries` and repeat from step 2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all these steps I think it would be really valuable to explain the why. Here we're willing to wait for brokers in log recovery because the following steps will result in actions, like restarting other brokers, which will be directly visible to clients. We prefer to start from a cluster that's as close to fully functional as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to explain why are we willing to wait for log recovery here, but not willing to wait for all the brokers replicas to rejoin the ISR?

IIRC the reason we had was the KafkaRoller's job was to restart things, but we didn't want to give it any responsibility for throttling interbroker replication, and we can't guarantee (for all possible workloads) that brokers will always be able to catch up to the LEO (within a reasonable time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tombentley. I have added the reasons.

Reconfigure nodes in the `RECONFIGURE` group:
- Check if `numReconfigAttempts` exceeds `maxReconfigAttempts`. If exceeded, add a restart reason and repeat from step 2. Otherwise, continue.
- Send `incrementalAlterConfig` request, transition state to `UNKNOWN`, and increment `numReconfigAttempts`.
- Wait for each node's state to transition to `READY` within `operationTimeoutMs`. If timeout is reached, repeat from step 2, otherwise continue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By definition the node's state will already be READY (otherwise it would have been in the RESTART_NOT_RUNNING group), therefore there is no transition to observe.

It's never been clear to me what would be a good safety check on a dynamic reconfig. Some of the reconfigurable configs could easily result in a borked cluster, so it feels like some kind of check is needed. I think we need to take into account any effects of the reconfiguring of this node on the other nodes in the cluster. I guess step 10 is intended to achieve this, but it's not clear to me how step 10 different from just always restarting from step 2 after each reconfiguration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The roller will transition the node to UNKNOWN state after taking an action so that the state can be observed again, but you are right, that would likely to return READY immediately. As you said, the step 10 will repeat from step 2 if at the point the reconfigured node has gone bad. When repeating from step 2, if the node is not ready but since there is no reason to restart and reconfigure (because it's already been reconfigured) , we would end up waiting for it to become ready until the reconciliation fails. Perhaps, we could fail the reconciliation with an error indicating that a node is not ready after reconfiguration, so we notify the human operator to do the investigation through the log.

Updated the text on the diagram

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
@tinaselenge
Copy link
Contributor Author

Hi @tombentley @scholzj @ppatierno, do you have any further comments on this proposal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants