Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix throttling replica list after a broker replacement #430

Closed
wants to merge 2 commits into from

Conversation

kravisha91
Copy link
Contributor

@kravisha91 kravisha91 commented Sep 28, 2023

During a broker replacement it was noticed that a broker that was previously involved in reassignment when got replaced with a new node (it is common for brokers to fail and get a replacement node) did not throttle all the topic replicas the broker was hosting and that led to increase network usage and eventually more load on the broker affecting clients connected to it causing latency issues. The proposed fix is to refresh the replica throttled list when there is a broker override irrespective of the broker involvement in other reassignments. More details can be found in the notebook - https://ddstaging.datadoghq.com/notebook/6601106/investigating-missing-throttles-on-topics-after-broker-replacement#overview

@kravisha91 kravisha91 force-pushed the kravishankar/STREAMS-2055 branch from 045a22a to 59e0513 Compare October 4, 2023 13:51
@kravisha91 kravisha91 force-pushed the kravishankar/STREAMS-2055 branch 3 times, most recently from ee92666 to fd22615 Compare October 4, 2023 19:27
@kravisha91 kravisha91 marked this pull request as ready for review October 5, 2023 13:32
@kravisha91 kravisha91 changed the title Apply throttle during broker replacement Fix throttling replica list after a broker replacement Oct 5, 2023
cmd/autothrottle/main.go Outdated Show resolved Hide resolved
cmd/autothrottle/main.go Outdated Show resolved Hide resolved
internal/autothrottle/replication/throttles.go Outdated Show resolved Hide resolved
toRemove[override.ID] = struct{}{}
} else {
toAssign[override.ID] = struct{}{}
capacities.storeLeaderAndFollerCapacity(override.ID, rate)
Copy link
Contributor

@mjd95 mjd95 Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it updates leader.replication.throttled.rate and follower.replication.throttled.rate only. What happens to the value of replication.throttled.replicas?

The desired state is that, for each topic, replication.throttled.replicas is set to be the union of whatever it should be for the reassignment, plus every replica on the broker with the override.

I'm guessing this is what should happen on L222 below, but I would expect that if the replaced broker was part of a reassignment then tm.skipOverrideTopicUpdates is true (so we skip), because of this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change I did here was to remove the if condition !override.ReassignmentParticipant since we also want to include brokers that were previously part of reassignment. You are correct L222 is where the replication.throttled.replicas is set based on tm.skipOverrideTopicUpdates but the control never gets here even if skipOverrideTopicUpdates is false if a broker is a reassignment participant because there is a return nil statement on line L210 which checks for any brokers in toAssign map. Hence I had to remove the if condition and allow it happen for all broker overrides.

internal/autothrottle/replication/topics_legacy.go Outdated Show resolved Hide resolved
@kravisha91 kravisha91 force-pushed the kravishankar/STREAMS-2055 branch from fd22615 to c8c1eaa Compare October 10, 2023 17:23
@kravisha91 kravisha91 closed this Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants