Skip to content

WeeklyTelcon_20161206

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Artem Polyakov
  • Jeff Squyres
  • Karen (Executive Director - Software Freedom Conservancy)
  • Josh Hursey
  • Ralph
  • Todd Kordenbrock (HPE @ Sandia)
  • Slyvian Jeaugey Nvidia.

Agenda

Conservancy (Karen)

  • Introductions.
  • Do you have a feel for if the Conservancy would want to invite us?
    • Yes, your application looks good, and would expect that they would invite us.
  • We have an invite from SPI that expires in January, that drives the Open MPI time-table.
    • Karen will try to speed up an invite, so we know if we have an invite.
  • SPI and Conservancy are different organizations, as they have different functions.
    • Chances are you'll need one or the other, but need to consider what we need.
  • Differences:
    • SPI is a different legal model. Just affiliating with SPI. To use SPI to hold funds, and disperses it similarly on a grant making process (case-by-case). It's a loose affiliation.
    • SPI is more bare-bones, will do finances, but not it's main push.
    • Conservancy has paid people, but SPI is volunteers.
    • Organizations become a part of Conservancy. The project has a legal status. Conservancy can then execute contracts for them.
      • Have hired contractors, do paperwork.
    • 10% from projects (doesn't cover 1% of Conservancy costs). Have to do extra fund raising.
  • corporate members -
    • Require projects to establish governance mechanisms.
    • We need an official body in place to
  • Decided we don't need different legal protections to safeguard different projects from one another.
    • Charity, and unlikely target for lawsuit.
    • If we want to sign a contract for venue, they make sure that subproject has the funds to pay if no one comes to venue.
  • No fundraising requirements on member projects.
    • Just help projects come up with ways to fund themselves. Inkscape, twisted, piepie, Selineum.
  • Software licensing
    • Available if there are problems.
    • Some Copy-left ask for help with enforcement.
    • Some have had to relicense.
    • A lot of licensing expertise.
    • Have helped with attribution requirements of permissive licensing.
    • require that the software is "free" and "open" based on ____ definitions.

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.5
    • 10 open PRs on 1.10.5 - Newly changed in GITHUB - look closely under topic, should say if it's been approved). 2 approved, and 7 review required, and 1 pushed back.
    • The ones that are approved are urgent.
      • Schedule a release in January of 1.10.5.
    • Nathan's looking at a segv in PSM2, but not PSM. He will create issue after reproducing.
    • Not the known issue with PSM2 - Something about interrupt handler.

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20

  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker

  • Known / ongoing issues to discuss

    • Signal - Hey, I'm seeing procs get killed with SIGKILL, but not being first hit by SIGTERM first.
      • Ralph fixed, but then requirements changed.
    • Other problem - Revealed the real problem.
      • Issue a SIGTERM, go into a timer event, to issue the SIGKILL.
      • mpirun and daemon exit and leave the event library, and never fire the timer event to finish cleanup.
      • why we're getting zombies. Not trivial to solve. Ralph going to work on some more this week.
      • Is this a blocker? yes, because now remote procs will now only get a SIGTERM, but not a SIGKILL.
      • A regression from 1.8 behavior.
      • Ralph has an idea for a simpler fix, should have it today.
    • Any other blockers for 2.0.2?
      • Open a new Issue 2505 - osc_pt2pt wrong answer. Pretty vanilla one sided test case.
      • blocker: HColl Context Free (PR on 1.10.5, but Mellanox will PR to 2.0.x in next 2 days)
        • Still not in master - wondering if there is a reason. No one at Mellanox has merged it into Master.
        • Artem will talk to josh to pull into master.
    • Giles pushed a bunch to master, and was curious if it was an accident.
    • Question - Accidental push to master is possible. May want to look at the direction of going through PR.
      • Add to face to face agenda.
  • Next week SPI director will be on call.

    • If people are not testing with PMIx Async modex + drop through barriers, maybe they should.
      • for libraries that want all endpoints in Init, using PMIx_Dstore shows 15% improvement.
    • Collect the data fixed in master.
    • Mellanox is testing 2.1. until this fix comes in, collect the data. with UCX, any back-end would work.
      • because on first message will block until endpoint is available.
  • PMIx update

    • A couple of outstanding issues with the dstore.
      • performance on power architecture.
      • Should help memory footprint at scale.
    • Hope to roll a new 1.2 RC2 by friday.
    • Will update PMIx and Open MPI master.
    • On track for January Open MPI v2.1? PMIx and integrated with embedded.
    • Josh and Artem feels like mid-january. of PMIx 1.2 + integration in Open MPI v2.1.0.
  • OMPI 2.1

    • THE blocking issue is PMIx.
    • The BSD patcher - Nathan's been asked to work on it. Graceful fail is fine.
    • Fuzzy, estimate for End of January.
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0 *

Review Master MTT testing (https://mtt.open-mpi.org/)

  • No morning messages still. Need to pester Brian about. Apparently not allowed to make changes until after the new year.
    • mail from our AWS instance is not getting to us.
  • Biggest failures we saw in 2.0.x and 2.1.x
    • OSHMEM - BTL fix, fixed a bunch of things, but still a few errors (Segv), Put or Get not registered location.
      • Jeff will make a ticket for few remaining OSHMEM failures.
  • Sylvain seeing a bunch of errors in master oob/ud components
    • mostly timeouts. not sure if hanging, or really slow.
  • Josh - turned on Jenkins testing at IBM, may result in timeouts. Using PGI on PPC64.

MTT Dev status:

  • Put up a PR for combinatorial executor. Still a bug in submitter.

  • Telcom tomorrow.

  • Face to Face in January - https://github.com/open-mpi/ompi/wiki/Meeting-2017-01

  • SC BOF

    • Should we do 2.2 or 3.0? Poll to the community.
      • 87% said go for 3.0.
    • Went way too long
    • Bad time slot (not sure why), since we only had half of people we normally do.
  • PMIx update - Decided to do a PMIx 2.0 release (what was going to be PMIx 3.0) - January time frame.

  • libevent update - they have put out an RC for 2.1.7 (OMPI 2.x is on libevent 2.0)

    • 2 years of code changes, though most are not in our usage path.
    • Still some, somewhat scarey changes in main path, so need to test well. evaluate before adding to OMPI 2.x
    • There is an external component for libevent, so there is that option.

Status Updates:


Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally