Skip to content

WeeklyTelcon_20180109

Geoffrey Paulsen edited this page Jan 15, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

--- Will fill out as meeting starts

  • Geoff Paulsen
  • Brian
  • Howard Pritchard
  • Artem Polyakov
  • Josh Hursey
  • David Bernholdt
  • Edgar Gabriel
  • Geoffroy Vallee
  • Josh Ladd
  • Mathew
  • Ralph
  • Todd Kordenbrock
  • Nathan Hjelm
  • Thomas Naughton

Agenda

  • Jan 9th:
    • Decided last week to push date to late feb or march.
    • Discuss abandoning openib btl.
      • Want Chelcio and nvidia to be part of discussion.
      • Abandoning might leave some people.
      • Nathan has a UCX BTL
      • Nathan would like for general stanity.
      • What are the situations that openib gives you functionality that you don't get from libfabrics or UCX.
        • iWarp.
      • GPU support - UCX is working on this, but it's not ready.
      • Issue - no one wants to support the code.
      • Send email to OpenIB is looking at end of life, and request someone to step up and support.
      • We care until UCX support GPGPU (soon).
      • People are still using it, probably because they haven't
      • Idea towards a deprecation path.
        • turn off for default in non-supported paths,e tc.
      • Summary - leave it in there for now, with some bandaids for now, and
    • Should next release be v3.2 or v4.0
      • Discussion in PR4401
      • History:
        • Right after v3.0 branched, an ABI break came into master.
        • But then we merged ABI break to v3.0, but never downgraded master back to v3.x
      • Need to go back an audit to see if there are any ABI breaks in master.
        • Pretty sure no ABI breaks from v3.1 and master.
      • PROCESS question - Historically when we rev MAJOR numbers, we've also reved ALL of the shared libraries
      • Decided we'll keep it at v4.0, but not break .so versioning unless audit determine it's needed on a library by library basis.
      • Fortran change - Function prototypes - These couldn't have possibly worked, so tread it like NO ABI change.
        • fix and doc, but not
      • OSHMEM - pull requests against v3.1, and v3.0 as well.
        • Okay to go into v4.0, don't PR back to v3.x releases.
      • Process going forward -
        • ADD / REMOVE / CHANGED - should make it easier for release managers.
        • VERY painful for Release Managers to go through logs to determine if shared library changed.
    • Test infrastructure
      • Mellanox Jenkins is failing on every PR. What are we going to do about this? is it real/not?
        • can reproduce with a multithreaded test.
        • Looks like it's hanging with openIB, TCP, (not yalla), jenkins says Vader hangs.
        • sometime infrastructure uses timeout command, othertimes ompi_timeout happens.
      • One segfault due to vader, rest due to timeout.
      • Forget that segfault is possibly due to timeout command.
      • Multithreaded test-11bw ??
      • looks like an atomic issue (both multi-threaded TCP and Vader).

Minutes

Review v2.x Milestones v2.1.2

  • Shooting for Release on Jan 19th, and RC later this week.
  • Issue4682 - just better error checking enhancement, so shouldn't go into release branch
  • Issue 4336 and 4453 - Edgar to Backport a few PRs:
    • PR4454 - backport fixes in v3.x branch to v2.x branch.
    • PR4351 - memory consumption

Review v3.0.x Milestones v3.0

  • Schedule: RC2
    • On 3.x series trying to cut RCs on nightly tarballs.
    • Will
  • Duped issue: Mpool init hang AND Current blocker: Hang on ARM in v3.0.x
    • Only hangs in debug. Bad, but not ship-stopper.
    • Doesn't happen in optimized mode
    • Issue 4563 - not seeing on little arm boxes here, Jenkins uses --disable-builtin-atomics.
      • Because when we disable atomics on powerpc, compiler thinks we have cmp-set128.
      • On arm uses old-school lock-based lifo and fifo.
    • Fix being worked in PR3988 - bug in PGI compiler
  • Issue 4509 madvise hook
    • Jeff and Howard will discuss.
    • Now that we hook madvise, we need to be more careful.
    • Nathan hopes his PR 4576 on master would reduce the occurances to 0, but need user to verify.
      • may have to invalidate a LARGE region, even though it's mostly valide just because glibc invalideded a small part of it.
    • Tested PR 4576 in master last week,
      • Still need to merge into v2.x, v3.0.x and v3.1.x
  • Do we need to Pull PR 4628 into v3.0.x?
    • broken in v3.0.0 and later, but it's just launch performance not hang.
    • decided NOT to block v3.0.1 for this, and fix this in v3.0.2

Review v3.1.x Milestones v3.1

  • SCHEDULE: Like to get out in late January
  • BLOCKER:
    • 4605 - update PMIx to v2.1.0 - Just a refresh of that directory.
    • 4523 - OSC monitoring component when portals is configured.
  • Issue 2168 - hasn't that been resolved? 2168. - Brian will link and close.
  • Brian will Issue 4303

Review Master Master Pull Requests

Process

  • When your PR has been accepted into a release branch, please go to the issue, and remove the target of the release branch that it was just merged into. Attempting to automate this in the future.

MTT / Jenkins Testing Dev

Oldest PR

Oldest Issue

Next face-to-face meeting

  • pushed date to late feb or march.

Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally