Skip to content

WeeklyTelcon_20210413

Geoffrey Paulsen edited this page Apr 13, 2021 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Austen Lauria (IBM)
  • Brendan Cunningham (Cornelis Networks)
  • Christoph Niethammer (HLRS)
  • Edgar Gabriel (UH)
  • Geoffrey Paulsen (IBM)
  • Harumi Kuno (HPE)
  • Hessam Mirsadeghi (UCX/nVidia)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Joseph Schuchart
  • Josh Hursey (IBM)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Cornelis Networks)
  • Naughton III, Thomas (ORNL)
  • Raghu Raja (AWS)
  • Ralph Castain (Intel)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia/Mellanox)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Brian Barrett (AWS)
  • Charles Shereda (LLNL)
  • David Bernhold (ORNL)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • George Bosilca (UTK)
  • Joshua Ladd (nVidia/Mellanox)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Todd Kordenbrock (Sandia)
  • Tomislav Janjusic
  • Xin Zhao (nVidia/Mellanox)

New Items

4.0.x

  • blocking on PR 8769 issues (see New topics above)
    • Mark commented 5 days ago that it might still not work.
  • OSU with Dynamic window issue 8774
    • Many folks in Israel are out this week.
    • Not a show stopper, requires options, and presumably some variant of UCX will fix it.
    • Same as issue 8212?
      • No, 8774 is different and older.

v4.1.x

  • Taking small PRs
  • Same waiting-mode as v4.0.x

v5.0.0

  • Pushing back the alpha build for v5.0.0 from this Friday to NEXT friday.
  • Issue 8776 - libevent confusion if running with external 3rd party tools
    • PR 8792 - Need to move this over to v5.0.x
    • Need to check with Brian if this is relevant on v4.0 or v4.1
      • compile with --disable-dlopen, or slurp in all of the plugins.
      • 3 line change, should be small work.
      • Not a linker error, job just hangs and fails, really might want on v4.0 and v4.1
  • PR 8799 - should probably be PRed to v5.0
    • Howard's concerned that these package specific for config lookups, into the way that mpicc is linked, (for example cray)
      • mpicc --show - shows some long dependencies.
      • Just let him know on the ticket.
      • Howard will update the ticket.
  • Docs - Man pages will be included in this effort.
    • Likely include nroff and http in the tarball (so users don't need sphynx, and don't need internet)
    • If this doesn't make v5.0.0, it can go into later.
  • Packagers need some advice, and need a README, few more weeks at minimum.

Master

  • 8808 - same memory backing file.
    • what is the failure profile for this?
    • Rare, but what happens is if two users are sharing a node, and we leave backing files because a job fails, another user tries to create the backing file, it can conflict. So we add user-id to give a little more safety for conflicting.
    • Does mean that there's a cleanup issue for shared memory files.
      • Only reason is because moved the backing file out of dev/shmem.

Reformatting master

  • Came to the conclusion that we want this as soon as possible.
  • Geoff has a clang-v10 format file.
  • Ongoing code changes?
    • git integration.
  • Some things it does really UGLY.
    • Struct tables, it munges together.
    • no way we could find to say "leave this alone".
    • If we have a CI that's checking, then that'll fail.
    • Are we going to ask everyone to put this git-config and just deal with it?
    • How do we make this ongoing?
      • Brian has a to-do to fail a PR if it doesn't like the format?
      • just hasn't bubbled up.
      • Will want some way to exclude something.
      • Brian CANT turn on REPO yet, because entire repo hasn't been done yet.
  • How will people cleanly do this BEFORE they put them up (so people don't play wack-a-mole).
  • Can do the reformat before we have this checker.
    • CI checker will hit some issues.
  • When you do it, do it with a clean clone, and don't init the submodules
  • from Christoph Niethammer HLRS (Guest) to Everyone:
Clang-format understands also special comments that switch formatting in a delimited range. The code between a comment // clang-format off or /* clang-format off */ up to a comment // clang-format on or /* clang-format on */ will not be formatted. The comments themselves will be formatted (aligned) normally. (source: https://releases.llvm.org/11.1.0/tools/clang/docs/ClangFormatStyleOptions.html )
  • Sessions branch is pretty big, But howard wants to wait until v5.0.0 has been released for a while.
    • So plan was to wait for rest of formatting until sessions is rebased, and then format master.
    • Howard's having a few more issues on sessons, so is okay with us reformatting
    • Wont merge to Open MPI master until v5.0.0 is at a point where it won't take big PRs.
    • Rebase this on top of ULFM is also challenging.
    • Probably will do a 2 stage rebase. Rebase up to the Opal reformat, and then squash, and possibly run clang-reformat on the sessions branch, and then try to rebase on top of whatever else is on master.

PMIx

  • possibly a few weeks out.

PRRTE v2.0

  • Release a few weeks out.

Some outstanding work for the way that OMPI calls PRRTE configure.

  • Also some changes with libcurl, especially since this breaks OMPI built.
    • PMIx can interface with REST interfaces (used by libcurl)
    • JSON
    • Build system issue in PMIx when we changed to static DSOs.
    • Think this has been resolved

issue 8801 - mpirun and prefixing.

  • rhc has no strong issues either way.
  • We prepend LD_LIBRARY_PATH pointing to the PRRTE installation.
    • At the moment in OMPI, we overlay this with OMPI library location.
    • Seems like the best fix would be to make these two independent.
  • PREFIX - enable prefix by default.
    • In Open MPI happens to be the same as the OMPI prefix.
    • But PRRTE does this by default, because we want the daemons to match the commands.
      • OMPI doesn't want to do that. And that's okay
  • Instead of --enable prefix-by-default we need --enable mpi-prefix-by-default.
  • Looking at it from OMPI perspective
    • user asked for prefixing, user wants prefixing... dont care if same or not, just want it to work.
    • If user DOESNT want prefixing, then dont want EITHER prefixing.
      • But if have a global PRRTE that might want prefixed.
  • PRRTE will prefix by default
  • What happens when I want MPI libs redirected?
    • Problem is if you build PRRTE INTERNAL, then you can't redirect MPI libraries.
  • Gotta set PATH and LD_LIBRARY_PATH correctly
    • One of those things, --enable-prefix is NOT default in < v4.0
  • There are times when want to redirect OMPIs to a different set of libraries.
    • right now it's a configure / compile time, which is problematic. have to redo all of the subcomponents.
    • What would be nice is if this was at runtime, so that ompi's mpirun can find all of the subcomponents at runtime.
  • Setting LD_LIBRARY_PATH is the way to point to another set of libraries.
    • This breaks because mpirun will overwrite LD_LIBRARY_PATH.
    • Personally Doesn't want this as a default.
    • Joseph doesn't want us setting LD_LIBRARY_PATH

MTT

  • Jeff will discuss absoft to upgrade gcc (need C11 compiler for 32bit support)

Open-MPI v5.0

  • PMIx and PRRTE are close to a release canidate.
    • This week ( First full week of April)
  • What do we do with the mpirun Manpage?
    • Didn't want OMPI requiring Sphynx, but if PRRTE and PMIx in same tar
  • Ralph almost has singleton comm spawn working
    • Single node without the mpirun process

Longer Term discussions

Doc update

  • OMPI docs and manpages, but persistant problem that mpirun is really prrterun

  • PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.

    • Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
    • Intent this is for v5.0
      • mpirun / prrterun - we had quite a bit of details in orte, but are updating as much as possible.
    • Ralph has asked about this for PMIx/PRRTE since this is turning out to work
  • No update - 3/16

    • Could be independent of PMIx and PRRTE.
    • PMIx and PRRTE want to follow suite, and not require both pandoc and sphynx.
Clone this wiki locally