Skip to content

WeeklyTelcon_20200114

Geoffrey Paulsen edited this page Jan 14, 2020 · 2 revisions

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoffrey Paulsen (IBM)
  • Jeff Squyres (Cisco)
  • Brian Barrett (AWS)
  • Austen Lauria (IBM)
  • Charles Shereda (LLNL)
  • Edgar Gabriel (UH)
  • David Bernhold (ORNL)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Joseph Schuchart
  • Michael Heinz (Intel)
  • Ralph Castain (Intel)
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)
  • Thomas Naughton (ORNL)

not there today (I keep this for easy cut-n-paste for future notes)

  • Noah Evans (Sandia)
  • George Bosilca (UTK)
  • Artem Polyakov (Mellanox)
  • Matthew Dosanjh (Sandia)
  • Brandon Yates (Intel)
  • Erik Zeiske
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Xin Zhao (Mellanox)
  • mohan (AWS)
  • Akshay Venkatesh (NVIDIA)
  • Josh Hursey (IBM)
  • Joshua Ladd (Mellanox)
  • Brendan Cunningham (Intel)

New Business

  • Brian has todo: Coverity coverage for PRRTE

  • PR 6821 - Merged last week.

    • First PR with a submodule. Uses hwloc via submodules.
    • Name of the hwloc component changed to hwloc2 (not hwloc20x)
  • PR 7284 - https://github.com/open-mpi/ompi/pull/7284

    • common / PML abstraction break issue might point to other issues.

Release Branches

Review v3.0.x Milestones v3.0.4

Review v3.1.x Milestones v3.1.4

  • 7267 need a review on datatype thing.
  • probably won't get --hostfile fix so, may be last releases.
  • probably make 3.0 RC in immediate future.
  • 7276 Possibly a configure test for pmix warning/error.
    • Jeff needs to review for v3.1.x

Review v4.0.x Milestones v4.0.3

  • PR7283 - Nathan may have found the issue last night,
    • Trivial fix.
  • We've had xpmem rcache issue. This would be good.
  • v4.0.3 in the works.
    • Schedule: End of january.
    • Try to get rc1 built this Friday
  • If PMIx 3.1.5 is released we'd like to take that.
  • push off comm-spawn
  • issue 6960 (close) had something cherry-picked to release branch, but it's still not fixed.
    • Configuring --enable-ipv6 shouldn't preclude ipv4.
    • Do we need to cherry-pick 6964 back into v4.0.x ?

v5.0.0

  • Schedule: April 2020?

Face to face

  • It's official! Portland Oregon, Feb 17, 2020.
  • Please register on Wiki page, since Jeff has to register you.
  • Date looks good. Feb 17th right before MPI Forum
    • 2pm monday, and maybe most of Tuesday
    • Cisco has a portland facility and is happy to host.
    • about 20-30 min drive from MPI Forum, will probably need a car.

Infrastrastructure

Review Master Master Pull Requests

CI status


Depdendancies

PMIx Update

  • There may be a new PMIx v3.1.5 in January, we could pickup for v4.0.3.
    • We'll know next week

ORTE/PRRTE

  • ORTE-removal/PRRTE PR is ready to be committed.

    • Mellanox CI is still failing on OSHMEM.
    • Hand testing is looking fine.
    • using an ORTE parameter, and then OSHMEM then fails because dir doesn't exist or wrong permissions.
    • Mellanox said they'ed look at, but were not on the WebEx today.
  • Some of these things require some thought.

    • Assume we still want to convert as many ORTE parameters as possible.
    • PRRTE doesn't use the mca parameter like ORTE did.
    • PRRTE is a persistant model, so can't just set an mca parameter for all jobs...
    • Ralph is doing mpirun command line parsing conversion.
    • prun is the one-shot mpirun equivalent.
    • we could have mpirun binary instead of symlink.
      • Ralph is using schizo framework to do this work instead
    • There are many that there
  • Ralph sent email about submodules

    • Issue was would update the PRRTE repo
    • But PMIx was still embedded as a tarball.
    • Once Ralph converted PMIx to submodule, then could pull on both.
  • Still a bunch of things to do after this PR goes in.

  • Singleton comm-spawn... how do we make this work? - PMIx understands it.

    • Do we need to support singleton comm-spawn starting the PRRTEs?
    • Now that we will support a persistant infrastructure, maybe we just require users to start it first.
  • Address comm-spawn issues that have been raised.

MTT


Back to 2019 WeeklyTelcon-2019

Clone this wiki locally