-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker: call PMI_Abort() if something goes wrong during PMI bootstrap #6279
Conversation
Problem: the UPMI abstract PMI interface does not implement the PMI abort function, but this is useful to avoid hangs when something goes wrong on one node during PMI bootstrap. Implement the abort function in the abstract interface and in the simple, libpmi, libpmi2, and singleton implementations.
Problem: the instance hangs during startup if a bind address cannot be determined. If a fatal error occurs during PMI bootstrap on some but not all ranks, some brokers may block forever in the PMI barrier. Call the PMI abort function when something goes wrong during PMI bootstrap. Fixes flux-framework#6278
Problem: some flux-start.c code does not conform to project norms. Break long parameter lists to one per line.
Problem: if the flux broker calls the PMI abort function, flux-start (the PMI server) is not notified. Add an abort callback that logs a message and asks all subprocesses to terminate immediately. Update tbon.interface-hint test that was expecting broker to exit 1 on a bad hint. The broker is now terminated with SIGKILL. Update another tbon.interface-hint test that required a ratcheted down timeout. That test now fails immediately.
Problem: there is no way to test the UPMI abort function without starting a Flux instance. Add an --abort=RANK option to 'flux pmi barrier'. The specified rank calls the abort function instead of the barrier.
Problem: the shell pmi plugin includes code that does not conform to project norms. Break long parameter lists to one per line. Indent function parameters to the same level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! LGTM!
@@ -104,7 +104,7 @@ static void shell_pmi_abort (void *arg, | |||
*/ | |||
flux_shell_raise ("exec", | |||
0, | |||
"MPI_Abort%s%s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit message typo: "when the broker call" should be "when the broker calls".
Also
the shell PMI plugin logs an exception message calls out MPI_Abort()
"that calls out MPI_Abort()"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I pushed that commit message fix and will set MWP.
Problem: when the broker calls PMI the abort function, the shell PMI plugin logs an exception message that calls out "MPI_Abort()" but MPI is not involved. Log "PMI_Abort()" instead. Fix one test that expected the old message.
Problem: there is no coverage for abort in some of the upmi implementations. Add tests.
Problem: the barrier --abort=RANK option has no documentation. Add it to the man page.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6279 +/- ##
==========================================
- Coverage 83.36% 83.35% -0.02%
==========================================
Files 522 522
Lines 85927 85993 +66
==========================================
+ Hits 71636 71681 +45
- Misses 14291 14312 +21
|
Problem: brokers can hang if some of them enter the PMI barrier and others encounter a fatal error.
For example, #6278 demonstrates such a hang when
tbon.interface-hint
suggests an interface that doesn't exist.With this PR, that example fails immediately like this:
or