Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: call PMI_Abort() if something goes wrong during PMI bootstrap #6279

Merged
merged 9 commits into from
Sep 12, 2024

Commits on Sep 12, 2024

  1. libpmi: add upmi_abort()

    Problem: the UPMI abstract PMI interface does not implement the
    PMI abort function, but this is useful to avoid hangs when something
    goes wrong on one node during PMI bootstrap.
    
    Implement the abort function in the abstract interface and in the
    simple, libpmi, libpmi2, and singleton implementations.
    garlick committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    f98ae8b View commit details
    Browse the repository at this point in the history
  2. broker: call upmi_abort() on PMI bootstrap error

    Problem: the instance hangs during startup if a bind address cannot
    be determined.
    
    If a fatal error occurs during PMI bootstrap on some but not all ranks,
    some brokers may block forever in the PMI barrier.
    
    Call the PMI abort function when something goes wrong during PMI bootstrap.
    
    Fixes flux-framework#6278
    garlick committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    9370400 View commit details
    Browse the repository at this point in the history
  3. flux-start: fix whitespace issues

    Problem: some flux-start.c code does not conform to project norms.
    
    Break long parameter lists to one per line.
    garlick committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    cd92cd3 View commit details
    Browse the repository at this point in the history
  4. flux-start: implement PMI abort callback

    Problem: if the flux broker calls the PMI abort function,
    flux-start (the PMI server) is not notified.
    
    Add an abort callback that logs a message and asks all subprocesses
    to terminate immediately.
    
    Update tbon.interface-hint test that was expecting broker to exit 1
    on a bad hint.  The broker is now terminated with SIGKILL.
    
    Update another tbon.interface-hint test that required a ratcheted down
    timeout.  That test now fails immediately.
    garlick committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    7254e54 View commit details
    Browse the repository at this point in the history
  5. flux-pmi: add barrier --abort option

    Problem: there is no way to test the UPMI abort function without
    starting a Flux instance.
    
    Add an --abort=RANK option to 'flux pmi barrier'.
    The specified rank calls the abort function instead of the barrier.
    garlick committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    ae4226b View commit details
    Browse the repository at this point in the history
  6. shell/pmi: fix whitespace issues

    Problem: the shell pmi plugin includes code that does not conform
    to project norms.
    
    Break long parameter lists to one per line.
    Indent function parameters to the same level.
    garlick committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    47cb028 View commit details
    Browse the repository at this point in the history
  7. shell/pmi: fix abort message

    Problem: when the broker calls PMI the abort function, the shell PMI
    plugin logs an exception message that calls out "MPI_Abort()" but MPI
    is not involved.
    
    Log "PMI_Abort()" instead.
    Fix one test that expected the old message.
    garlick committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    3f29b85 View commit details
    Browse the repository at this point in the history
  8. testsuite: cover flux pmi barrier --abort

    Problem: there is no coverage for abort in some of the upmi
    implementations.
    
    Add tests.
    garlick committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    74cc37b View commit details
    Browse the repository at this point in the history
  9. flux-pmi(1): document barrier --abort=RANK

    Problem: the barrier --abort=RANK option has no documentation.
    
    Add it to the man page.
    garlick committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    fba4dee View commit details
    Browse the repository at this point in the history