Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[da-vinci] Disabled status reporting when the DaVinci app is still bootstrapping #1203

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

gaojieliu
Copy link
Contributor

@gaojieliu gaojieliu commented Sep 27, 2024

Summary

In Prod, we are seeing the bootstrapping nodes are blocking the new push, and this PR will let Controller ignore the status report from the nodes, which are still bootstrapping.
This feature is especially important for DaVinci apps, which are using multiple large stores and bootstrapping from an empty state can take a long time and it can delay the new push significantly.

How was this PR tested?

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

…otstrapping

In Prod, we are seeing the bootstrapping nodes are blocking the
new push, and this PR disables the status reporting if the app
is still bootstrapping.
This feature is especially important for DaVinci apps, which are
using multiple large stores and bootstrapping from an empty state
can take a long time and it can delay the new push significantly.
FelixGV
FelixGV previously approved these changes Sep 27, 2024
Copy link
Contributor

@FelixGV FelixGV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, makes sense, LGTM.

1. Subscription won't return if there are still active current version bootstrapping
   to avoid serving stale current version.
2. Delete the heartbeat entry for the current instance if the instance is bootstrapping
   to avoid Controller mark the bootstrapping node as crashed, which can fail the
   new push jobs.
}

public static boolean hasCurrentVersionBootstrapping(Map<String, StoreIngestionTask> ingestionTaskMap) {
for (Map.Entry<String, StoreIngestionTask> entry: ingestionTaskMap.entrySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question/concern: Does this work with ingestion isolation feature? Because essentially ingestion will start on forked process

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, it should.
At the startup time, all the version ingestions will happen in the isolated process including current version until it is finished, and once it is finished and handed over to the main process, it won't matter as it is completed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I am still thinking the case -
Say there is a II DVC, now it is bootstrapping with one current version and a future version.
Regarding future version, it will still ingest in forked process until RTS is reached. But in the middle, it will report STARTED/END_OF_PUSH_RECEIVED and the above change in the reportPushStatus logic does block it from reporting. So ZK/system store will still register such status. And when controller is polling the job status, it will see it as a not-ready-replica and blocked on it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it will not block, but it may potentially lead to too many dead instances because you also delete HB for the host, and fail fast for the job? (but if these bootstrapping nodes are small enough, then we might be good.....)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete HB won't result in dead instances and it is used to avoid dead instances..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants