-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker: call PMI_Abort() if something goes wrong during PMI bootstrap #6279
Commits on Sep 12, 2024
-
Problem: the UPMI abstract PMI interface does not implement the PMI abort function, but this is useful to avoid hangs when something goes wrong on one node during PMI bootstrap. Implement the abort function in the abstract interface and in the simple, libpmi, libpmi2, and singleton implementations.
Configuration menu - View commit details
-
Copy full SHA for f98ae8b - Browse repository at this point
Copy the full SHA f98ae8bView commit details -
broker: call upmi_abort() on PMI bootstrap error
Problem: the instance hangs during startup if a bind address cannot be determined. If a fatal error occurs during PMI bootstrap on some but not all ranks, some brokers may block forever in the PMI barrier. Call the PMI abort function when something goes wrong during PMI bootstrap. Fixes flux-framework#6278
Configuration menu - View commit details
-
Copy full SHA for 9370400 - Browse repository at this point
Copy the full SHA 9370400View commit details -
flux-start: fix whitespace issues
Problem: some flux-start.c code does not conform to project norms. Break long parameter lists to one per line.
Configuration menu - View commit details
-
Copy full SHA for cd92cd3 - Browse repository at this point
Copy the full SHA cd92cd3View commit details -
flux-start: implement PMI abort callback
Problem: if the flux broker calls the PMI abort function, flux-start (the PMI server) is not notified. Add an abort callback that logs a message and asks all subprocesses to terminate immediately. Update tbon.interface-hint test that was expecting broker to exit 1 on a bad hint. The broker is now terminated with SIGKILL. Update another tbon.interface-hint test that required a ratcheted down timeout. That test now fails immediately.
Configuration menu - View commit details
-
Copy full SHA for 7254e54 - Browse repository at this point
Copy the full SHA 7254e54View commit details -
flux-pmi: add barrier --abort option
Problem: there is no way to test the UPMI abort function without starting a Flux instance. Add an --abort=RANK option to 'flux pmi barrier'. The specified rank calls the abort function instead of the barrier.
Configuration menu - View commit details
-
Copy full SHA for ae4226b - Browse repository at this point
Copy the full SHA ae4226bView commit details -
shell/pmi: fix whitespace issues
Problem: the shell pmi plugin includes code that does not conform to project norms. Break long parameter lists to one per line. Indent function parameters to the same level.
Configuration menu - View commit details
-
Copy full SHA for 47cb028 - Browse repository at this point
Copy the full SHA 47cb028View commit details -
Problem: when the broker calls PMI the abort function, the shell PMI plugin logs an exception message that calls out "MPI_Abort()" but MPI is not involved. Log "PMI_Abort()" instead. Fix one test that expected the old message.
Configuration menu - View commit details
-
Copy full SHA for 3f29b85 - Browse repository at this point
Copy the full SHA 3f29b85View commit details -
testsuite: cover flux pmi barrier --abort
Problem: there is no coverage for abort in some of the upmi implementations. Add tests.
Configuration menu - View commit details
-
Copy full SHA for 74cc37b - Browse repository at this point
Copy the full SHA 74cc37bView commit details -
flux-pmi(1): document barrier --abort=RANK
Problem: the barrier --abort=RANK option has no documentation. Add it to the man page.
Configuration menu - View commit details
-
Copy full SHA for fba4dee - Browse repository at this point
Copy the full SHA fba4deeView commit details