-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[server] Race condition fix for re-subscription of real-time topic partitions. #1192
Merged
haoxu07
merged 4 commits into
linkedin:main
from
haoxu07:raceConditionFixForResubscription
Sep 30, 2024
Merged
[server] Race condition fix for re-subscription of real-time topic partitions. #1192
haoxu07
merged 4 commits into
linkedin:main
from
haoxu07:raceConditionFixForResubscription
Sep 30, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
haoxu07
force-pushed
the
raceConditionFixForResubscription
branch
from
September 23, 2024 05:12
912157a
to
50f8772
Compare
haoxu07
changed the title
Race condition fix for resubscription.
[server] Race condition fix for re-subscription of real-time topic partitions.
Sep 23, 2024
haoxu07
force-pushed
the
raceConditionFixForResubscription
branch
2 times, most recently
from
September 23, 2024 06:00
489c3e7
to
74763da
Compare
haoxu07
force-pushed
the
raceConditionFixForResubscription
branch
from
September 24, 2024 00:48
74763da
to
41e81d3
Compare
haoxu07
force-pushed
the
raceConditionFixForResubscription
branch
from
September 25, 2024 21:53
41e81d3
to
9aeaaf8
Compare
haoxu07
changed the title
[server] Race condition fix for re-subscription of real-time topic partitions.
[server][WIP for debugging CI] Race condition fix for re-subscription of real-time topic partitions.
Sep 25, 2024
haoxu07
force-pushed
the
raceConditionFixForResubscription
branch
7 times, most recently
from
September 26, 2024 17:24
1a0fa1f
to
28df58b
Compare
haoxu07
changed the title
[server][WIP for debugging CI] Race condition fix for re-subscription of real-time topic partitions.
[server] Race condition fix for re-subscription of real-time topic partitions.
Sep 26, 2024
haoxu07
force-pushed
the
raceConditionFixForResubscription
branch
from
September 26, 2024 20:25
28df58b
to
bb726fc
Compare
haoxu07
force-pushed
the
raceConditionFixForResubscription
branch
from
September 27, 2024 00:19
b3b7174
to
0ad40ea
Compare
lluwm
reviewed
Sep 27, 2024
.../da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/KafkaConsumerService.java
Show resolved
Hide resolved
haoxu07
force-pushed
the
raceConditionFixForResubscription
branch
2 times, most recently
from
September 30, 2024 21:41
2a7466d
to
adade07
Compare
haoxu07
force-pushed
the
raceConditionFixForResubscription
branch
from
September 30, 2024 22:26
adade07
to
3ce8001
Compare
lluwm
approved these changes
Sep 30, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @haoxu07 for the fix and it looks good to me!
gaojieliu
added a commit
to gaojieliu/venice
that referenced
this pull request
Oct 7, 2024
…topic partitions. (linkedin#1192)" This reverts commit 06cc1fc.
2 tasks
haoxu07
added a commit
that referenced
this pull request
Oct 30, 2024
…pic partitions. (#1263) This improvement addresses a previously reverted #1192. The unit test in the prior fix triggered separate threads resubscribing to the same real-time topic partition, resulting in frequent resubscriptions. These frequent resubscriptions caused excessive metric emissions, leading to GC issues and causing other unit tests to fail in CI. To prevent metric emissions, we use a mocked AggKafkaConsumerServiceStats instead of passing null when instantiating KafkaConsumerService in the unit test, as KafkaConsumerService would otherwise create a real AggKafkaConsumerServiceStats when null is passed. Previous #1192 commit message: Recently we introduced re-subscription feature for different store versions. Leader and follower partitions from different version topics will experience re-subscription triggered by store version role change concurrently. Problem: Two StoreIngestionTask threads for different store versions (store version 1 and store version 2) try to do re-subscription (unsub and sub) for the same real-time topic partition concurrently. During re-subscription triggered by one store version, operation of consumer.unSubscribe to remove assignment of the topic partition and operation of consumerToConsumptionTask.get(consumer).removeDataReceiver(pubSubTopicPartition) are sequentially executed. It is possible that StoreIngestionTask thread for store version 2 got the same ConsumptionTask, but DataReceiver inside ConsumptionTask is still from store version 1). As each ConsumptionTask will only allow one DataReceiver for one particular real-time topic partition, the DataReceiver from store version 2 will not be able to be added to this ConsumptionTask. Solution: Using SharedKafkaConsumer level lock to protect KafkaConsumcerService for unSubscribe, SetDataReceiver, unSubscribeAll to guarantee there will one real-time partition from specific store version doing DataReceiver related assignment change. Co-authored-by: Hao Xu <xhao@xhao-mn3.linkedin.biz>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary, imperative, start upper case, don't end with a period
Recently we introduced re-subscription feature for different store versions. Leader and follower partitions from different version topics will experience re-subscription triggered by store version role change concurrently.
Problem:
Two
StoreIngestionTask
threads for different store versions (store version 1 and store version 2) try to do re-subscription (unsub and sub) for the same real-time topic partition concurrently. During re-subscription triggered by one store version, operation ofconsumer.unSubscribe
to remove assignment of the topic partition and operation ofconsumerToConsumptionTask.get(consumer).removeDataReceiver(pubSubTopicPartition)
are sequentially executed. It is possible thatStoreIngestionTask
thread for store version 2 got the sameConsumptionTask
, butDataReceiver
insideConsumptionTask
is still from store version 1). As eachConsumptionTask
will only allow oneDataReceiver
for one particular real-time topic partition, theDataReceiver
from store version 2 will not be able to be added to thisConsumptionTask
.Solution:
Using
SharedKafkaConsumer
level lock to protectKafkaConsumcerService
forunSubscribe
,SetDataReceiver
,unSubscribeAll
to guarantee there will one real-time partition from specific store version doingDataReceiver
related assignment change.How was this PR tested?
Does this PR introduce any user-facing changes?