Skip to content

WeeklyTelcon_20170606

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Edgar Gabriel
  • Artem Polyakov
  • Jeff Squyres (Cisco)
  • Howard Pritchard
  • Josh Hursey
  • Joshua Ladd
  • Mohan
  • Todd Kordenbrock
  • David Bernholdt
  • Nathan Hjelm
  • Ralph
  • Brian Barrett (Amazon)
  • Geoffroy Vallee
  • Mark Allen (IBM)
  • Sylvain Jeaugey
  • Thomas Naughton

Agenda

  • Discuss Progression Issue 3616
    • openib progression issue.
    • Nathan will try to look at this, this week.
    • Not sure where we might block in the callbacks.
    • George is re-working progression model, but because we're getting new model, we just need an ugly solution for now.
    • Hit this in a non-contiguous one-sided put down in openib via osc_rdma. The accumulate wants to trigger another callback. And then a barrier to get the timing right.
    • Nathan thinks he can make the unlock non-blocking in the accumulate lock.

2.0.3

  • released June 1st.
  • No driver for a v2.0.4 at this time.

Review v2.x

  • v2.1.1 went out in May
  • No Driver for v2.1.2 at this time.
  • Planning to do v3.0 RC today, but lots of failures in nightly MTTs.
    • Cisco killed a bunch, and will re-kick-off a bunch.
  • Datatype and Info Key type errors out of IBM tests.
  • Amazon false positives because they're direct launching, but don't support dynamic processes in direct launch.
  • Howard sent out request for NEWs updates.
  • One additional PMIx issue.
    • orte, opal and PMIx, threading issue from IBM.
    • Some confusion if we have assembly backwards in PMIx 2.0 (Nathan or George)
    • Nathan can take a look when he gets into office today.
    • only seen evidence in PMIx.
    • Issue is in PMIx: https://github.com/pmix/pmix/issues/347
  • Ralph sync up with Brian and Howard end of day to hear status of issue, for v3.0 RC.

  • Still seeing some 'make check' errors has been fixed.
    • IBM still seeing a hang in 'make check' - must be ppc64le specific. No timeout.
  • 32bit compiler stuff fixed in pmix fix.
  • Geoffroy Vallee - still seeing some problems disabling make check.
  • MPI_Send_receive_replace - got fixed.
  • Timeouts are all CUDA related - nvidia.
    • still there.
  • Issue: Redhat stock autoconf (rather than build our own)
  • Need a maintainer for rankfile mapper.
    • IBM will take up maintaining rankfile mapper from Ralph.

MTT Dev status:

  • Intel making lots of progress. Nice features, but not sure how to make the transition.
  • .ini files would need to be transitioned across because python doesn't support funclets.
  • Does everyone have to transition the same day, or can the transition be one by one.
    • Yes, everyone can transition in their own time.

Exceptional topics

  • Face2Face Meeting-2017-07
    • Date: July 11-13 (9am Tuesday - noon on Thursday.
    • Cisco has booked space in Chicago.
      • Cisco has reserved some space right next to O-Hare (can get shuttle to hotel).
        • we have met there before.
      • Jeff will come in Monday evening.

Status Updates:

  • Amazon - bringing much more testing online, and CI processes.
    • v3.0.0 Release work
    • Improved Jenkins infrastructure. Hopefully some changes yesterday (in Jenkins setup at Amazon) will make it run a little faster.
  • Travis is now officially deactivated. No longer using Travis.

Status Update Rotation

  1. Amazon
  2. Cisco, ORNL, UTK, NVIDIA
  3. Mellanox, Sandia, Intel
  4. LANL, Houston, IBM, Fujitsu

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally