Introducing new KafkaRoller #103

tinaselenge · 2024-01-02T13:33:26Z

For more implementation details, the POC implementation code can be checked in RackRolling.java. All the related classes are in the same package , rolling.

The tests illustrating various cases with different set of configurations is in RackRollingTest.java.

The logic for switching to the new roller is in KafkaReconciler.java class.

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

06x-new-kafka-roller.md

Made some improvements on the structure Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

fvaleri

Just a first pass, as I need more time to digest this. I think it would be useful to illustrate the new behavior with a couple of examples of the form: with this roller configuration and cluster state, these are the node groups and their restart order. Wdyt?

06x-new-kafka-roller.md

Tidy up Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge · 2024-04-23T15:20:22Z

@fvaleri Thank you for the feedback. I have added an example of rolling update. Please let me know what you think.

see-quick

Nice proposal. Thanks for it 👍 .

STs POV:

I think we would need to also design multiple tests to cover all states, which KafkaRoller v2. We have a few tests but for sure that's not 100% coverage. So, we should maybe have a meeting to talk about this...

Side note about performance:

What would be appropriate performance metrics for us to consider when designing performance tests? Are there any critical ones? For sure I can image that we would see significant improvement on RollingUpdates of multiple nodes when we use batching mechanism...

06x-new-kafka-roller.md

Co-authored-by: Maros Orsak <maros.orsak159@gmail.com> Signed-off-by: Gantigmaa Selenge <39860586+tinaselenge@users.noreply.github.com>

fvaleri

@tinaselenge thanks for the example, it really helps.

I left some comments, let me know if something is not clear or you want to discuss further.

06x-new-kafka-roller.md

fvaleri · 2024-04-25T15:15:17Z

06x-new-kafka-roller.md

+   - Cruise Control sends `removingReplicas` request to un-assign the partition from broker 2.
+   - KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed.
+   - The reassignment request is processed and foo-0 partition now has ISR [1, 4].
+   - KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition.


In addition to rebalance, we have the same race condition with replication factor change (the new integration between CC and TO), maybe you can mention this.

The roller should be able to call the CC's user_tasks endpoint, and check if there is any pending task. In that case, the roller has two options: wait for all tasks completion, or continue as today with the potential issue you describe here. You can't really stop the tasks because the current batch will still be completed, and the operators will try to submit a new task in the next reconciliation loop.

I think that we should let the user decide which policy to apply through a configuration. By default the roller would wait for all CC tasks to complete, logging a warning. If the user set or switch to "force" policy, then the roller would behave like today. Wdyt?

Should this be perhaps included/discussed in a separate proposal or issue? The idea was to mention that there is a race condition we could fix with the new roller in the future, which is not easy to fix with the old roller. How we fix it and other similar problems should be a separate discussion I think.

This should have a dedicated proposal IMO, but let's start by logging an issue.

Would calling the ListReassigningPartitions API be enough to know this?

06x-new-kafka-roller.md

katheris

Overall this looks good to me, but I had a few questions and wording suggestions. I definitely think this will be useful since I've experienced first hand how tricky it is to debug the existing code.

06x-new-kafka-roller.md

Add possible transitions Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

06x-new-kafka-roller.md

ppatierno · 2024-06-04T10:30:16Z

06x-new-kafka-roller.md

+If none of the above is true but the node is not ready, then its state would be `NOT_READY`.
+
+#### Flow diagram describing the overall flow of the states
+![The new roller flow](./images/06x-new-roller-flow.png)


Some comments about the diagram of the FSM ...

Each state has to be unique and not duplicated.

If you are grouping states together, does it really mean they are different state? Or are they different state with different context? It looks weird to me.

Where you use "Action and Transition State" I think that: Action should be cleared, because it represents an output for the FSM? Is there a real output or just transition? You don't need "Transition State" because it's defined by the arrow you have.

My gut general feeling is that ... you are describing a very complex state machine and transitions in the "Algorithm" section but then the visualization of it is really weak.

I meant the diagram to be more from high level and then broken down with more details in the Algorithm section. Given that, we may have to change various things around states and stuff, I will take this suggestion and recreate the diagram.

Is the diagram changed after my comments? It seems almost the same. Also, what's "Serving" or "Waiting" or "Restarted" or "Reconfigured" ... they are not listed as states in the table above.
I can't really match the table with the diagram, sorry :-(

@ppatierno I have updated the diagram now. It's still more of high level flow showing the transitions. The possible states are listed but does not mean they are grouped together. Depending on the state, a different action taken and the possible actions are also listed. Do you think it's clearer now? Or do you think I should break down to each state mapping to the possible actions, instead of listing them together in a same bubble?

well an FSM state machine should have single states (but I can understand your grouping of more than one here, so let's leave it for now). What I can't understand is what's the hexagon and it's content. They are not states, right? I can't see them in the table. So their presence confuses me.
Also you are duplicating circles for Not_Ready/Not_Running/Recovery: you have one with them and Ready, and one with them alone. I think you should split and having Ready on its own, as well as Leading_All_preferred.
Finally as an FSM state diagram it should not contain other stuff like "Start" or "Desired State Reached", you are describing an FSM, and those ones are not states.

It's start getting a better shape :-) but I would still improve it by reducing duplications. I think you can have just one "Unknown" (instead of 2) and one "Ready" state (instead of 3). Even because AFAIU, the final desired state is "LeadingAllPreferred" right? and from the current graph not all "Ready" states end there.
Next we'll talk about the pink/red states which have some duplications as well .

Good to hear that it is in the right direction. I've taken the suggestion and made another update :). Thanks @ppatierno

What does exactly "Iterate" mean on the orange arrows?

that's repeating the process on node if the desired state is not reached.

I have updated the diagram with a bit more explanation instead of the "iterate" part.

06x-new-kafka-roller.md

ppatierno · 2024-06-04T10:48:55Z

06x-new-kafka-roller.md

+## Rejected
+
+- Why not use rack information when batching brokers that can be restarted at the same time?
+When all replicas of all partitions have been assigned in a rack-aware way then brokers in the same rack trivially share no partitions, and so racks provide a safe partitioning. However nothing in a broker, controller or cruise control is able to enforce the rack-aware property therefore assuming this property is unsafe. Even if CC is being used and rack aware replicas is a hard goal we can't be certain that other tooling hasn't reassigned some replicas since the last rebalance, or that no topics have been created in a rack-unaware way.


I am not sure the above is considered a rejected alternative. I mean, this section is for solutions for the same goals which were rejected, while it seems to be used just to highlight that an "idea" within the current proposal was rejected.

of course, we would need to agree if we are rejecting this idea. Perhaps I should rename this into other ideas considered?

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

- Improve the names for categories and states - Remove restarted/reconfigured states - Add a configuration for delay between restarts - Add a configuration for delay between restart and trigger of preferred leader election - Restart NOT_RUNNING nodes in parallel for quicker recovery - Improve the overall algorithm section, to make it clearer and concise Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge · 2024-07-11T10:09:20Z

Thanks everyone who reviewed the PR! I believe I now have addressed all the review comments except an update of the diagram (will push that in a follow up commit). @scholzj @ppatierno @fvaleri @tombentley , could you please take another look when you have time? Thank you very much.

06x-new-kafka-roller.md

ppatierno · 2024-07-16T10:07:42Z

06x-new-kafka-roller.md

+If none of the above is true but the node is not ready, then its state would be `NOT_READY`.
+
+#### Flow diagram describing the overall flow of the states
+![The new roller flow](./images/06x-new-roller-flow.png)


Is the diagram changed after my comments? It seems almost the same. Also, what's "Serving" or "Waiting" or "Restarted" or "Reconfigured" ... they are not listed as states in the table above.
I can't really match the table with the diagram, sorry :-(

ppatierno · 2024-07-16T10:13:03Z

06x-new-kafka-roller.md

+   Contexts are recreated in each reconciliation with the above initial data.
+
+2. **Transition Node States:**
+   Update each node's state based on information from abstracted sources. If failed to retrieve information, the reconciliation fails and restarts from step 1.


I would say that this is about "loading" or "building" the current state of the node. Usually in our FSMs (i.e. rebalancing), this state is coming from a custom resource, here it's coming from different sources. Maybe we can say it better instead of "update each node's ...." .

I'm not sure what would be better, does Load each node's state based on information... sound better?

We build a context in step 1, which has state UNKNOWN. In this step, we are updating this state based on the information from the sources. So to me, update each node's state sounds fine.

06x-new-kafka-roller.md

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

06x-new-kafka-roller.md

fvaleri

Hi @tinaselenge, I had another look at the example and I think is great. Left few more comments, but I think this would work.

06x-new-kafka-roller.md

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

fvaleri

Thanks for answering all my questions. Good job.

tinaselenge · 2024-07-26T13:59:25Z

Thanks for answering all my questions. Good job.

Thank you @fvaleri , I really appreciate you reviewing the proposal thoroughly.

tombentley · 2024-07-31T03:04:05Z

06x-new-kafka-roller.md

+- Although it is safe and straightforward to restart one broker at a time, this process is slow in large clusters ([related issue](https://github.com/strimzi/strimzi-kafka-operator/issues/8547)).
+- It does not account for partition preferred leadership. As a result, there may be more leadership changes than necessary during a rolling restart, consequently impacting tail latency.
+- Hard to reason about when things go wrong. The code is complex to understand and it's not easy to determine why a pod was restarted from logs that tend to be noisy.
+- Potential race condition between Cruise Control rebalance and KafkaRoller that could cause partitions under minimum in sync replica. This issue is described in more detail in the `Future Improvements` section.


In general Slack is not really ideal for keeping details of problems in the long term. Better to create an issue, which can be discovered more easily by anyone who faces a similar problem.

06x-new-kafka-roller.md

tombentley · 2024-07-31T03:19:24Z

06x-new-kafka-roller.md

+   Update each node's state based on information from abstracted sources. If failed to retrieve information, the current reconciliation immediately fails. When the next reconciliation is triggered, it will restart from step 1.
+
+3. **Handle `NOT_READY` Nodes:**
+   Wait for `NOT_READY` nodes to become `READY` within `operationTimeoutMs`.


Judging from the fact that the next step covers NOT_READY, I'm guessing that we just fall through if the node is still NOT_READY after operationTimeoutMs. But you need to say that! And also explain, if we're prepared to fall though to the next step, why this timeout is even necessary.

I have explained why we do the wait and that it falls through the next step.

06x-new-kafka-roller.md

tombentley · 2024-07-31T03:26:17Z

06x-new-kafka-roller.md

+   - `NOP`: Nodes needing no operation.
+
+5. **Wait for Log Recovery:**
+   Wait for `WAIT_FOR_LOG_RECOVERY` nodes to become `READY` within `operationTimeoutMs`. If timeout is reached and `numRetries` exceeds `maxRetries`, throw `UnrestartableNodesException`. Otherwise, increment `numRetries` and repeat from step 2.


For all these steps I think it would be really valuable to explain the why. Here we're willing to wait for brokers in log recovery because the following steps will result in actions, like restarting other brokers, which will be directly visible to clients. We prefer to start from a cluster that's as close to fully functional as possible.

We also need to explain why are we willing to wait for log recovery here, but not willing to wait for all the brokers replicas to rejoin the ISR?

IIRC the reason we had was the KafkaRoller's job was to restart things, but we didn't want to give it any responsibility for throttling interbroker replication, and we can't guarantee (for all possible workloads) that brokers will always be able to catch up to the LEO (within a reasonable time).

Thanks @tombentley. I have added the reasons.

tombentley · 2024-07-31T03:39:12Z

06x-new-kafka-roller.md

+   Reconfigure nodes in the `RECONFIGURE` group:
+   - Check if `numReconfigAttempts` exceeds `maxReconfigAttempts`. If exceeded, add a restart reason and repeat from step 2. Otherwise, continue.
+   - Send `incrementalAlterConfig` request, transition state to `UNKNOWN`, and increment `numReconfigAttempts`.
+   - Wait for each node's state to transition to `READY` within `operationTimeoutMs`. If timeout is reached, repeat from step 2, otherwise continue.


By definition the node's state will already be READY (otherwise it would have been in the RESTART_NOT_RUNNING group), therefore there is no transition to observe.

It's never been clear to me what would be a good safety check on a dynamic reconfig. Some of the reconfigurable configs could easily result in a borked cluster, so it feels like some kind of check is needed. I think we need to take into account any effects of the reconfiguring of this node on the other nodes in the cluster. I guess step 10 is intended to achieve this, but it's not clear to me how step 10 different from just always restarting from step 2 after each reconfiguration.

The roller will transition the node to UNKNOWN state after taking an action so that the state can be observed again, but you are right, that would likely to return READY immediately. As you said, the step 10 will repeat from step 2 if at the point the reconfigured node has gone bad. When repeating from step 2, if the node is not ready but since there is no reason to restart and reconfigure (because it's already been reconfigured) , we would end up waiting for it to become ready until the reconciliation fails. Perhaps, we could fail the reconciliation with an error indicating that a node is not ready after reconfiguration, so we notify the human operator to do the investigation through the log.

Updated the text on the diagram Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge · 2024-09-02T12:29:03Z

Hi @tombentley @scholzj @ppatierno, do you have any further comments on this proposal?

tinaselenge force-pushed the kafka-roller-2 branch from 8c79a95 to 9c6154b Compare January 2, 2024 13:52

Intoducing the new KafkaRoller that only supports KRaft mode

c74f0b4

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from 9c6154b to c74f0b4 Compare January 2, 2024 16:04

Add explanation for retrying the node

5abafe6

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from ca68601 to 5abafe6 Compare February 19, 2024 18:03

tombentley reviewed Feb 20, 2024

View reviewed changes

tinaselenge force-pushed the kafka-roller-2 branch from 56d7a24 to 4baf73a Compare February 21, 2024 12:21

Address review comments

33ec40e

Made some improvements on the structure Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from 4baf73a to 33ec40e Compare February 21, 2024 13:30

fvaleri reviewed Feb 24, 2024

View reviewed changes

tinaselenge force-pushed the kafka-roller-2 branch 2 times, most recently from c060c24 to 433316f Compare March 15, 2024 12:11

Address comments from Federico

4f91a5a

Tidy up Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from 433316f to 4f91a5a Compare March 15, 2024 12:12

tinaselenge marked this pull request as ready for review March 15, 2024 12:29

Add more description on how unready nodes are handled.

941fe43

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from 97bdef2 to 941fe43 Compare April 3, 2024 15:48

Add an example of rolling restart

1147134

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

see-quick reviewed Apr 25, 2024

View reviewed changes

Update 06x-new-kafka-roller.md

9a777c3

Co-authored-by: Maros Orsak <maros.orsak159@gmail.com> Signed-off-by: Gantigmaa Selenge <39860586+tinaselenge@users.noreply.github.com>

fvaleri reviewed Apr 25, 2024

View reviewed changes

katheris reviewed Apr 25, 2024

View reviewed changes

tinaselenge commented Apr 30, 2024

View reviewed changes

06x-new-kafka-roller.md Show resolved Hide resolved

tinaselenge force-pushed the kafka-roller-2 branch 3 times, most recently from 931adbd to 1060fee Compare April 30, 2024 13:58

katheris reviewed Apr 30, 2024

View reviewed changes

06x-new-kafka-roller.md Outdated Show resolved Hide resolved

Address review comments

e56d1f8

Add possible transitions Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from 1060fee to e56d1f8 Compare May 3, 2024 10:28

ppatierno reviewed Jun 4, 2024

View reviewed changes

tinaselenge force-pushed the kafka-roller-2 branch 2 times, most recently from d3e3f1d to 557eb7a Compare June 6, 2024 10:53

Address some of the review comments from Jakub and Paolo

e3e4d33

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from 557eb7a to e3e4d33 Compare June 6, 2024 15:06

tinaselenge added a commit to tinaselenge/strimzi-kafka-operator that referenced this pull request Jul 2, 2024

Address strimzi/proposals#103 (comment)

0878a01

tinaselenge mentioned this pull request Jul 2, 2024

Add configurable wait time between restart and preferred leader election tinaselenge/strimzi-kafka-operator#8

Open

8 tasks

tinaselenge added a commit to tinaselenge/strimzi-kafka-operator that referenced this pull request Jul 2, 2024

Address strimzi/proposals#103 (comment)

fb03839

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge mentioned this pull request Jul 2, 2024

Restart NOT_RUNNING nodes in parallel tinaselenge/strimzi-kafka-operator#9

Open

ShubhamRwt mentioned this pull request Jul 10, 2024

Introduce cool down interval between restarts tinaselenge/strimzi-kafka-operator#10

Open

8 tasks

tinaselenge force-pushed the kafka-roller-2 branch from 033da99 to ec38009 Compare July 11, 2024 10:01

fvaleri reviewed Jul 15, 2024

View reviewed changes

ppatierno reviewed Jul 16, 2024

View reviewed changes

Address review comments from Federico and Paolo

7cccffb

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

showuon reviewed Jul 18, 2024

View reviewed changes

06x-new-kafka-roller.md Outdated Show resolved Hide resolved

06x-new-kafka-roller.md Show resolved Hide resolved

fvaleri reviewed Jul 18, 2024

View reviewed changes

06x-new-kafka-roller.md Outdated Show resolved Hide resolved

06x-new-kafka-roller.md Show resolved Hide resolved

tinaselenge force-pushed the kafka-roller-2 branch 3 times, most recently from 4c035fa to bf71ae6 Compare July 19, 2024 08:30

Update the state diagram

660f2ac

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from bf71ae6 to 660f2ac Compare July 22, 2024 16:48

fvaleri approved these changes Jul 26, 2024

View reviewed changes

tombentley reviewed Jul 31, 2024

View reviewed changes

Address comments from Tom

e9a9859

Updated the text on the diagram Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

tinaselenge force-pushed the kafka-roller-2 branch from ad75fcf to e9a9859 Compare August 1, 2024 13:09

Tidy up

6ab8400

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>

Introducing new KafkaRoller #103

Are you sure you want to change the base?

Introducing new KafkaRoller #103

Conversation

tinaselenge commented Jan 2, 2024 • edited Loading

fvaleri left a comment

Choose a reason for hiding this comment

tinaselenge commented Apr 23, 2024

see-quick left a comment • edited Loading

Choose a reason for hiding this comment

fvaleri left a comment • edited Loading

Choose a reason for hiding this comment

fvaleri Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katheris left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tinaselenge commented Jul 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fvaleri left a comment

Choose a reason for hiding this comment

fvaleri left a comment

Choose a reason for hiding this comment

tinaselenge commented Jul 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tinaselenge commented Sep 2, 2024

tinaselenge commented Jan 2, 2024 •

edited

Loading

see-quick left a comment •

edited

Loading

fvaleri left a comment •

edited

Loading

fvaleri Apr 25, 2024 •

edited

Loading