Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: call PMI_Abort() if something goes wrong during PMI bootstrap #6279

Merged
merged 9 commits into from
Sep 12, 2024

Conversation

garlick
Copy link
Member

@garlick garlick commented Sep 12, 2024

Problem: brokers can hang if some of them enter the PMI barrier and others encounter a fatal error.

For example, #6278 demonstrates such a hang when tbon.interface-hint suggests an interface that doesn't exist.

With this PR, that example fails immediately like this:

$ flux start -s2 -o,-Stbon.prefertcp=1,-Stbon.interface-hint=192.16.1.1/24
flux-broker: could not find address associated with 192.16.1.1/24
flux-start: 0: PMI_Abort(): fatal bootstrap error
flux-start: 0 (pid 1227875) exited with rc=1
flux-start: 1 (pid 1227876) Killed

or

$ flux start -s2
ƒ(s=2,d=0,builddir) $ flux batch -N2 --broker-opts=-Stbon.interface-hint=192.16.1.1/24 --wrap true
ƒH51ZWG7

ƒ(s=2,d=0,builddir) $ flux job attach ƒH51ZWG7
0.075s: job.exception type=exec severity=0 PMI_Abort: fatal bootstrap error
flux-job: task(s) Killed
0: stdout redirected to flux-ƒH51ZWG7.out
0: stderr redirected to flux-ƒH51ZWG7.out

ƒ(s=2,d=0,builddir) $ cat flux-ƒH51ZWG7.out
flux-broker: could not find address associated with 192.16.1.1/24

src/cmd/builtin/pmi.c Fixed Show fixed Hide fixed
Problem: the UPMI abstract PMI interface does not implement the
PMI abort function, but this is useful to avoid hangs when something
goes wrong on one node during PMI bootstrap.

Implement the abort function in the abstract interface and in the
simple, libpmi, libpmi2, and singleton implementations.
Problem: the instance hangs during startup if a bind address cannot
be determined.

If a fatal error occurs during PMI bootstrap on some but not all ranks,
some brokers may block forever in the PMI barrier.

Call the PMI abort function when something goes wrong during PMI bootstrap.

Fixes flux-framework#6278
Problem: some flux-start.c code does not conform to project norms.

Break long parameter lists to one per line.
Problem: if the flux broker calls the PMI abort function,
flux-start (the PMI server) is not notified.

Add an abort callback that logs a message and asks all subprocesses
to terminate immediately.

Update tbon.interface-hint test that was expecting broker to exit 1
on a bad hint.  The broker is now terminated with SIGKILL.

Update another tbon.interface-hint test that required a ratcheted down
timeout.  That test now fails immediately.
Problem: there is no way to test the UPMI abort function without
starting a Flux instance.

Add an --abort=RANK option to 'flux pmi barrier'.
The specified rank calls the abort function instead of the barrier.
Problem: the shell pmi plugin includes code that does not conform
to project norms.

Break long parameter lists to one per line.
Indent function parameters to the same level.
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM!

@@ -104,7 +104,7 @@ static void shell_pmi_abort (void *arg,
*/
flux_shell_raise ("exec",
0,
"MPI_Abort%s%s",
Copy link
Contributor

@grondo grondo Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message typo: "when the broker call" should be "when the broker calls".

Also

the shell PMI plugin logs an exception message calls out MPI_Abort()

"that calls out MPI_Abort()"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I pushed that commit message fix and will set MWP.

Problem: when the broker calls PMI the abort function, the shell PMI
plugin logs an exception message that calls out "MPI_Abort()" but MPI
is not involved.

Log "PMI_Abort()" instead.
Fix one test that expected the old message.
Problem: there is no coverage for abort in some of the upmi
implementations.

Add tests.
Problem: the barrier --abort=RANK option has no documentation.

Add it to the man page.
@mergify mergify bot merged commit 591b9f8 into flux-framework:master Sep 12, 2024
31 checks passed
Copy link

codecov bot commented Sep 12, 2024

Codecov Report

Attention: Patch coverage is 71.23288% with 21 lines in your changes missing coverage. Please review.

Project coverage is 83.35%. Comparing base (95fc972) to head (fba4dee).
Report is 10 commits behind head on master.

Files with missing lines Patch % Lines
src/common/libpmi/upmi.c 36.36% 7 Missing ⚠️
src/broker/boot_pmi.c 25.00% 3 Missing ⚠️
src/common/libpmi/upmi_libpmi.c 72.72% 3 Missing ⚠️
src/common/libpmi/upmi_libpmi2.c 72.72% 3 Missing ⚠️
src/common/libpmi/upmi_simple.c 66.66% 3 Missing ⚠️
src/cmd/builtin/pmi.c 85.71% 1 Missing ⚠️
src/common/libpmi/upmi_single.c 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6279      +/-   ##
==========================================
- Coverage   83.36%   83.35%   -0.02%     
==========================================
  Files         522      522              
  Lines       85927    85993      +66     
==========================================
+ Hits        71636    71681      +45     
- Misses      14291    14312      +21     
Files with missing lines Coverage Δ
src/cmd/flux-start.c 83.65% <100.00%> (+0.26%) ⬆️
src/shell/pmi/pmi.c 81.42% <100.00%> (ø)
src/cmd/builtin/pmi.c 84.49% <85.71%> (+0.06%) ⬆️
src/common/libpmi/upmi_single.c 79.62% <83.33%> (+0.46%) ⬆️
src/broker/boot_pmi.c 65.39% <25.00%> (-2.44%) ⬇️
src/common/libpmi/upmi_libpmi.c 78.91% <72.72%> (-0.66%) ⬇️
src/common/libpmi/upmi_libpmi2.c 78.46% <72.72%> (-0.35%) ⬇️
src/common/libpmi/upmi_simple.c 80.00% <66.66%> (-1.58%) ⬇️
src/common/libpmi/upmi.c 75.57% <36.36%> (-1.48%) ⬇️

... and 17 files with indirect coverage changes

@garlick garlick deleted the issue#6278 branch September 12, 2024 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants