Skip to content

WeeklyTelcon_20171017

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen (IBM)
  • Jeff Squyres
  • Edgar Gabriel
  • Josh Hursey
  • Joshua Ladd
  • Mohan
  • Ralph
  • Thomas Naughton
  • Artem Polyakov
  • Todd Kordenbrock
  • Brian
  • David Bernholdt
  • Geoffroy Vallee
  • George
  • Howard

Agenda

  • pool Framework - 4283
    • someone freed hostname (wrong)
    • proposed strduping hostname which is done many times.
    • proposed a string pool framework to reference count strings.
    • PMIx already has ability to store strings, and return a pointer to it, others can't free.
      • PMIx if shared memory, it's provided over shmem, if not shared mem, each process has 1 copy.
    • We've survived for years without this. Seems to be just a single issue.
    • Sounds like it might not be applicable across the rest of the code-base.
    • PMIx pages could be made read-only pages.
    • Overengineering? Perhaps we should add valgrind tests?

Review v2.0.x Milestones v2.0.4

  • Two issues left:
    • lets make hwloc fail if compiling external and it's hwloc 2.0.0 or later on OMPI v2.0.x and v2.1.x
    • Bring PMIX v1.2.4 back to v2.0.x ?? Basically bug fixes.
      • Most value is automate 1.13 patch (helps freeBSD or something?)
  • Issue DDT is broken on v2.x - asking if IBM resolved internally already, if could get that back.
  • Schedule: if we get PRs in today, we should aim to get v2.0.x release NEXT week.

Review v2.x Milestones v2.1.2

  • v2.1.3 (unscheduled, but probably jan 19, 2018)

    • PR4172 - a mix between feature / bugfix.
  • Are we going to do anything for v2.x for hwloc 2?

    • At least put in a configure error if detects hwloc v2.x
  • HWLoc is about to release v2.0

    • If topology info comes in from outside, what hwloc was that resource manager using?
    • Is the XML annotated with which version of hwloc generated it?
    • would be nice to gracefully fail, since fairly opaque.
    • Seems like we'll need a rosetta stone for
    • HWLOC is a static framework.
    • Brice is going to get HWLOC by super computing, but it might be tight.
    • Are we comfortable releasing with an alpha/beta version of HWLOC imbedded.
      • Jeff: at a minimum, we should get a beta quality version of hwloc to imbed.
    • OMPI 2.x will not work with HWLOC 2.0, because Changed APIs.
      • May want some configure errors (not in there yet)
    • 3.0 only works with older hwloc pre-2.0. In v3.0.x if it's hwloc 2.0, we error at configure.
    • in 3.1 branch external hwloc allows either hwloc 2.0 or older hwloc, but must decide at build time.
    • Still have to run 3.1 everywhere.
    • Do we want to backport the hwloc 2.0 support to v3.0?
      • Since we're closing the door to v1.x and v2.x, that might a good support statement

Review v3.0.x Milestones v3.0

  • v3.0.1 - Opened the branch for bugfixes Sep 18th.
    • Still targeting End of October for release of v3.0.1
    • Everything ready to push has been.
    • a few PRs need review.
  • Schedule:
    • Originally was scheduling for this week, but
    • Edgar has two open Issues, both fairly important:
      • PR to master - already pending.
      • NFS problem reported on mailing list. - coded, but not yet tested, and more worried. (4346 4334)
  • Issue 3904 - only milestone for 3.0.x filed by edgar.
    • Thought this was merged to v3.0.x anyway.
  • Iterating a bit on disabling cuda inside of hwloc 4249 PR on this branch.
    • Issue 4248 - disabling cuda on hwloc
    • On all existing release branches, do -cuda=no for hwloc configury.
    • Been merged into v2.x but not v3.0.x

Review v3.1.x Milestones v3.1](https://github.com/open-mpi/ompi/milestone/27)

  • v3.1.x - currently has hwloc 2.0 alpha

    • Could roll-back to hwloc 1.11.7 - has some perf issues on KNL
    • Could delay v3.1.x to mid-december to release.
    • Could ship both hwloc 1.11.7 by default, but also ship a hwloc 2.0 alpha component that would have to be explicitly requested at configure time.
    • Some strong objections to shipping other parties alpha/non-released software.
    • Could support this with an external component, and a blurb in the README of new feature and how to use use external component.
      • This would leave the hwloc 2.0 enhancements in OMPI, but back down the hwloc version to v1.11.7
    • Making a new component in v3.1 and backing down version to v1.11.7 - Brian will own (thanks)
  • Schedule - still do a v3.1 drop before super computing.

  • v3.1.x Snapshots are not getting posted.

    • Has to do with cron failures - went to ompi_team. cron on gater. Nightly cronjob sync.py.
      • Ralph is forwarding to Brian.
    • Causing nightly mtts to not be run.
    • Brian didn't get cron failure emails.
  • Add v3.1 to MTT tests

    • Database is active now to accept v3.1 tests.
  • Last week MTT disk filled up.

  • PMIx 2.1 should get in in time for v3.1

    • In master, but no PR to OMPI v3.1.x yet, since they haven't released it yet.
    • Still intending to whip OMPI v3.1.0 with PMIx 2.1, but backup plan is PMIx v2.0 (there now)
  • Administration

    • Revised Bylaws -
      • rewording "group" and "community" to be more explicit to reflect involvement level of developers or contributors in other ways (like providing resources, etc).
      • Helps support those who support us.
      • voting yet via reply in email is sufficient.
  • After Bylaws pass, will Nominate for formal membership

Review Master Master Pull Requests


MTT / Jenkins Testing Dev

This week Discussion Points.

  • Website - openmpi.org
    • Brian trying to make things more automated, so can checkout repo, etc. Repo is TOO large.
    • Majority of the problem is the Tarballs. and already storing those in S3.

Oldest PR

Oldest Issue

Next face-to-face meeting

  • Jan / Feb
  • Possible locations: San Jose, Portland, Albuquerque, Dallas

Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally