Skip to content

WeeklyTelcon_20210209

Geoffrey Paulsen edited this page Feb 10, 2021 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Jeff Squyres (Cisco)
  • Howard Pritchard (LANL)
  • Ralph Castain (Intel)
  • Geoffrey Paulsen (IBM)
  • Austen Lauria (IBM)
  • Joseph Schuchart
  • Hessam Mirsadeghi (UCX/nVidia)
  • Edgar Gabriel (UH)
  • Brendan Cunningham (Cornelis Networks)
  • Josh Hursey (IBM)
  • Matthew Dosanjh (Sandia)
  • Naughton III, Thomas (ORNL)
  • Raghu Raja (AWS)
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)
  • George Bosilca (UTK)
  • Aurelien Bouteiller (UTK)
  • Christoph Niethammer (HLRS)
  • Harumi Kuno (HPE)
  • Brian Barrett (AWS)
  • David Bernhold (ORNL)
  • Howard Pritchard
  • Marisa Roman (Cornelius)

not there today (I keep this for easy cut-n-paste for future notes)

  • Joshua Ladd (nVidia/Mellanox)
  • Michael Heinz (Cornelis Networks)
  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia/Mellanox)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Tomislav Janjusic
  • Xin Zhao (nVidia/Mellanox)

New Topics

  • PR 8435 - https://github.com/open-mpi/ompi/pull/8435/files#r570096876
    • Question as to what George was saying.
    • George just saying that MPI already has that info and we don't need to ask PMIx again.
      • Need it in HAN, and if we need it elsewhere, just move to base
    • That being said, George doesn't want it in Tuned at all.
    • mistake this was targeting v4.1 instead of master.
  • UCX Issue 8321,
    • We do need to understand what's going on , as there were comments saying we should not support anything older than 1.9.0, but then there was a comment that it's reproducable in 1.9 also
    • Is this a UCX problem, or a PML problem?
      • We don't know if it's PML or UCX
  • UCX 1.9.0 + OMPI 4.0.4 - Issue 8442
    • may not be related to Issue 8321
    • We're ready to cut an RC for both 4.1.1 and 4.0.6, these two are blocking.
  • UCX meeting is on Wednesdays
    • Howard may go tomorrow.
    • UCX community didn't like us configuring out, they're looking into
    • It'd be nice to link this to an issue tomorrow.
  • George will look 8466 (

4.0.x

  • Schedule - If we could get something for Issue 8321, we can do an RC soon.
  • We'll put out 4.0.6rc2 this week, but we'll know more about UCX and maybe 8466.
  • A few changes to PMIx v3.2 series, that we might want in v4.0.6rc2
    • Setting of hostname to NULL to protect against multiple PMIx init calls.

v4.1

  • Some AVX problems / emails with Jeff and Brian and user will email the email-distro

  • Will take PMIx stuff from RALPH

  • Will do a v4.1.1 RC

  • Issue 8379 - UCT appears to be default and not UCX

    • Jeff repinged for request
      • Does UCT BTL even get built?
      • Still in discussion in Issue 8102.
        • Common missconception that people can install over existing install.
  • Might be an older mca component from

  • We had a PR to have a Unique signature for each build.

    • If we had this, we could use this signature in the modules themselves, but then we'd avoid this issue at runtime, and only open mca if from same build.
      • We currently have something for mca VERSION, but we never update the mca version.
      • So maybe we want to add OMPI version into this mca version check.
        • But this might not be enough, as recompiles might have different configure.
        • We need something to have something to identify the configure itself.
  • 8431 - git commit checks as action.

  • hwloc are we tracking the usage of the hwloc topology loads?

    • George wants to take a stab at it. Using it in HAN and Treematch

Open-MPI v5.0

  • Setting a Goal to branch for v5.0 on Last working day of February.
    • geoff will send email to devel list
    • No comments other than good to set a goal, and try to make it.
  • Austen created a tab in the google spreadsheets
    • Excluded Mellanox due to MTT issues, Mellanox is passing 0 tests in their MTT.
    • Cisco Turned off Cisco MTT because the testing harness creates an MPI Window, which is failing
      • One-sided This is a blocker for release
      • Jeff's talking to Nathan every day.
        • Thinks there is an issue
        • Possibly because pt2pt got removed, so possibly just master only.
        • Christoph thinks it's master only, and that v4.0.x and v4.1.x is okay for one-sided issue
      • Might be
    • May be another issue, in that it should be an MPI_Abort, so shouldn't drop core.
  • MTT is showing that the master branch is pretty good. We don't need to wait for PRRTE to be complet to branch v5.0.x in OMPI
  • Raghu added an entry for libFabric entry.
  • One-sided tests are still busted. Do we keep running these if they're failing?
    • Nathan is actively working on, so hopeful we'll get this.
  • XL sheet needs to be updated, as most of the stuff for ompio

What's the state of ULFM (PR 7740) for v5.0?

  • PR on ompi-tests public - Aurelien
    • need to configure with ULFM to run the test.
    • If run inside of a slurm job, it should just figure it out.
  • Edgar atomicity issue for OMPIO. Not sure if it's a full feature, but need to have on radar.
    • Not yet resolved.
    • ETA: a few days after Edgar finds time. 2-3 weeks.

New Topics

  • Github 8431 - Git commit checks as github actions
    • check for bogus emails.
    • branches will have a file in a flag in a special path. 0 on master, 1 on release branches.
    • Jeff detailed out the cases for the checker.
    • When it's enabled and a commit is not exempted for any reason.
    • Jeff will move from draft to
  • Josh summarized discussion from last week in issue.
  • Anything else Josh needs to implement?
    • No, Josh will get to before end of month, before v5.0 branches.

Doc update

  • PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.
    • Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
    • Intent this is for v5.0
      • mpirun / prrterun - we had quite a bit of details in orte, but are updating as much as possible.
    • Ralph has asked about this for PMIx/PRRTE since this is turning out to work

master configure issue

  • for v5.0 both of these will need to be fixed.
  1. Luster configure option, Edger sees it, but no idea how to fix it.
    • Not sure if he should open an issue. Ralph thinks Giles fixed. Edger will give it a try
  2. SharedFP component, Edger opened an issue this morning.
    • Blocker for v5.0

Longer Term discussions

ROMIO Long Term (12/8)

  • What do we want to do about ROMIO in general.
    • OMPIO is the default everywhere.
    • Giles is saying the changes we made are integration changes.
      • There have been some OMPI specific changes put into ROMIO, meaning upstream maintainers refuse to help us with it.
      • We may be able to work with upstream to make a clear API between the two.
    • As a 3rd party package, should we move it upto the 3rd party packaging area, to be clear that we shouldn't make changes to this area?
  • Need to look at this treematch thing. Upstream package that is now inside of Open-MPI.
  • Might want a CI bot to watch a set of files, and flag PRs that violate principles like this.
  • Putting new tests there
  • Very little there so far, but working on adding some more.
  • Should have some new Sessions tests

What's going to be the state of the SM Cuda BTL and CUDA support in v5.0?

  • What's the general state? Any known issues?

  • AWS would like to get.

  • Josh Ladd - Will take internally to see what they have to say.

  • From nVidia/Mellanox, Cuda Support is through UCX, SM Cuda isn't tested that much.

  • Hessam Mirsadeg - All Cuda awareness through UCX

  • May ask George Bosilica about this.

  • Don't want to remove a BTL if someone is interested in it.

  • UCX also supports TCP via CUDA

  • PRRTE CLI on v5.0 will have some GPU functionality that Ralph is working on

  • Update 11/17/2020

    • UTK is interested in this BTL, and maybe others.
    • Still gap in the MTL use-case.
    • nVidia is not maintaining SMCuda anymore. All CUDA support will be through UCX
    • What's the state of the shared memory in the BTL?
      • This is the really old generation Shared Memory. Older than Vader.
    • Was told after a certain point, no more development in SM Cuda.
    • One option might be to
    • Another option might be to bring that SM in SMCuda to Vader(now SM)
  • Discussion on:

    • Didn't get to this week. :(
    • Draft Request Make default static https://github.com/open-mpi/ompi/pull/8132
    • One con is that many providers hard link against libraries, which would then make libmpi dependent on this.
    • Non-Homogenous clusters (GPUs on some nodes, and non-GPUs on some other)

Video Presentation

  • ECP Community days ( March 30-April 1st )
    • David Bernholdt and/or George Bosilica
    • Each day 90 minute time slots.
    • Get proposal in by this Friday.
Clone this wiki locally