forked from tensorflow/tensorflow
-
Notifications
You must be signed in to change notification settings - Fork 16
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add cluster register barrier feature - synchronized connects.
This has a few benefits: * Simplify state changes during init to be all-or-nothing op: Nobody will send heartbeat until everybody in the cluster has acked. That's less error flows to reason about during init. * This helps with consistent in-sync restarts. There's no weird staggering / cascading restart flows regardless of what the scheduler does. Concretely: 1. Task invokes connect. 2. Block until all tasks connect. In the meantime, tasks may restart multiple times without bringing down the service because initialization has not completed and there's nothing stateful that is corrupted. 3. All tasks connect together. 4. Start sending heartbeats --- Previously, we start sending heartbeats immediately after (1), which means that tasks restarting before (2) will result in service crashes, causing additional unnecessary restarts + scheduling overhead. PiperOrigin-RevId: 689142223
- Loading branch information
1 parent
ed9d15b
commit 25e1991
Showing
5 changed files
with
217 additions
and
34 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.