-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[da-vinci] Disabled status reporting when the DaVinci app is still bootstrapping #1203
base: main
Are you sure you want to change the base?
[da-vinci] Disabled status reporting when the DaVinci app is still bootstrapping #1203
Conversation
…otstrapping In Prod, we are seeing the bootstrapping nodes are blocking the new push, and this PR disables the status reporting if the app is still bootstrapping. This feature is especially important for DaVinci apps, which are using multiple large stores and bootstrapping from an empty state can take a long time and it can delay the new push significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, makes sense, LGTM.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/DaVinciBackend.java
Outdated
Show resolved
Hide resolved
1. Subscription won't return if there are still active current version bootstrapping to avoid serving stale current version. 2. Delete the heartbeat entry for the current instance if the instance is bootstrapping to avoid Controller mark the bootstrapping node as crashed, which can fail the new push jobs.
} | ||
|
||
public static boolean hasCurrentVersionBootstrapping(Map<String, StoreIngestionTask> ingestionTaskMap) { | ||
for (Map.Entry<String, StoreIngestionTask> entry: ingestionTaskMap.entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question/concern: Does this work with ingestion isolation feature? Because essentially ingestion will start on forked process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, it should.
At the startup time, all the version ingestions will happen in the isolated process including current version until it is finished, and once it is finished and handed over to the main process, it won't matter as it is completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I am still thinking the case -
Say there is a II DVC, now it is bootstrapping with one current version and a future version.
Regarding future version, it will still ingest in forked process until RTS is reached. But in the middle, it will report STARTED/END_OF_PUSH_RECEIVED and the above change in the reportPushStatus logic does block it from reporting. So ZK/system store will still register such status. And when controller is polling the job status, it will see it as a not-ready-replica and blocked on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it will not block, but it may potentially lead to too many dead instances because you also delete HB for the host, and fail fast for the job? (but if these bootstrapping nodes are small enough, then we might be good.....)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete HB won't result in dead instances and it is used to avoid dead instances..
Summary
In Prod, we are seeing the bootstrapping nodes are blocking the new push, and this PR will let Controller ignore the status report from the nodes, which are still bootstrapping.
This feature is especially important for DaVinci apps, which are using multiple large stores and bootstrapping from an empty state can take a long time and it can delay the new push significantly.
How was this PR tested?
Does this PR introduce any user-facing changes?