Meeting 2021 07_Minutes

Open MPI Developer's July 2021 Virtual Meeting

(due to COVID-19, this will be virtual instead of a face-to-face meeting)

The meeting will be determined by most availability. See https://doodle.com/poll/rd7szze3agmyq4m5?utm_source=poll&utm_medium=link

Meeting dates:

Thursday, July 22, 2021
- 12-2pm US Pacific time.
- 3-5pm US Eastern time.
- 8-10pm GMT
Thursday, July 29, 2021
- 8-10am US Pacific
- 11am-1pm US Eastern
- 4-6pm GMT

Attendance

Austen Lauria (IBM) Brendan Cunningham (Cornelis) Brian Barrett (AWS, July 22 only) Gengbin Zheng (Intel) Geoff Paulsen (IBM) Howard Pritchard (LANL) Jeff Squyres (Cisco) Josh Hursey (IBM) Michael Heinz Michael Heinz (Cornelis) Nathan Hjelm (Google, July 22 only) Nysal (IBM) Raghu Raja (Enfabrica) Ralph Castain (Nanook) (July 22 only) Samuel Gutierrez (LANL) Srirpaul Thomas Naughton (ORNL) Todd Kordenbrock (HPE/SNL) William Zhang (AWS, July 22 only?)

Agenda items

Please add Agenda items we need to discuss here.

MPI 4.0 Compliance.
- MPI-4 stuff we already have (either partially or completely - from the MPI-4.0 doc changelog):
  - 7: Persistent collectives
    - Have Giles Draft PR 9065
      - [Geoff will take a look]
    - Fujitsu did this in MPIX_ and has been moved to MPI_ on master
    - [ACTION: Geoff go ensure on v5.0.x]
      - [Done] Yes, All made it to v5.0.x branch
  - 24: Sessions https://github.com/open-mpi/ompi/pull/9097
    - Status: Howard is planning to squash many of the commits down (especially newest ones)
    - Has been making progress in hpc/ompi fork
    - Main thing that needs to be addressed before merging...
      - The way that Vader was not written (at least fast boxes) so that when it's closed it pushes all of the traffic to wire.
      - When Session finalize happens with Vader right after a reduce, getting segvs since fastboxes aren't writing all buffers before closing / freeing.
    - About one Month off assuming Vader issue isn't too much work.
    - Got rid of some topo stuff that was not included in MPI 4 standard.
      - Only MPI4 standard stuff, not other pieces.
    - Nathan rewired the way that Finalize works, so lots of changes in the way that OMPI works, neccasary
  - Jeff asked a question around Session Init / Finalize - how this interacts with PMIx Init/ Finalize.
    - Howard has been testing. Found some issues and fixed.
- MPI-4 stuff no one is working on yet (from the MPI-4.0 doc changelog):
  - This is a list of changes from the MPI-4.0 standard doc that would need code-changes. This is what Jeff thought might need changes
  - Using MPI-4.0 Tag on github issues
  - Some PRs have already gone in.
  - 3: Embiggened bindings [Issue 9194]
    - Good beginner project, since almost all bindings.
    - Extra bonus points, can use MPI forum pythonization effort work.
    - MPICH did this.
      - They scraped python code themselves a year ago
      - But at MPI Forum level, they're generating python code to dump JSON, and we can just USE this.
  - 4: error handling (is this all done by the recent UT/FT work?)
  - 6: MPI_ISENDRECV and MPI_ISENDRECV_REPLACE [Issue 9193]
  - 9: Partitioned communication
    - PR 8641 - Got merged in this summer. Is it complete?
      - YES in terms of MPI 4.0
    - There were some changes late in MPI-4.0 standard, Howard is pinging on PR if this covers that.
    - Also progress changes that came out of this.
      - Whenever user dips into MPI, it's correct to expect progress to occur???
      - Maybe it's parive function - see text there.
      - Any call into MPI library, (ex: MPI_Info_set) should now cause progress.
      - There's now an implication of progress - came out of discussion.
  - 10+11: MPI_COMM_TYPE_HW_[UN]GUIDED
    - Nothing yet. Anyone working on a prototype? Didn't they have a prototype written on top of MPI ?
    - Jeff will ping author, Howard will open an issue
  - 12: Update COMM_[I]DUP w.r.t. info propagation
    - Someone needs to check
  - 15: new info hints
    - If search for MPI-4.0 tag, and closed. You'll see code for this
    - Might still need some work to propagate to network backends.
      - OFI can take advantage of these hints.
  - 16: updated semantics of MPI_*_[GET|SET]_INFO [Issue 9192]
  - 13: MPI_COMM_IDUP_WITH_INFO
    - Someone needs to check
  - 15: new info hints
    - If search for MPI-4.0 tag, and closed. You'll see code for this
    - Might still need some work to propagate to network backends.
      - OFI can take advantage of these hints.
  - 16: updated semantics of MPI_*_[GET|SET]_INFO [Issue 9192]
    - Opal info layer we may support, just need to get into MPI layer
  - 17: update MPI_DIMS_CREATE (this might be done?)
  - 18: alignment requirements for windows
  - 21: MPI_ERR_PROC_ABORTED error class (was this added by UT/FT work?)
  - 22: Add MPI_INFO_GET_STRING [Issue 9192]
  - 23: Deprecate MPI_INFO_GET[_VALUELEN] [Issue 9192]
  - 25: Add MPI_INFO_CREATE_ENV [Issue 9192]
  - 26: Error reporting before INIT / after FINALIZE (was this added by UT/FT work?)
  - 27: Updated error handling (was this done by UT/FT work?)
  - 28: Updated semantics in MPI_WIN_ALLOCATE_SHARED
  - 29: Audit F08 binding for MPI_STATUS_SET_CANCELED
    - We should be good for this. By looking at .h.in code.
  - 30: Add MPI_T_* callbacks
    - A large PR 8057 that's still waiting on reviews.
    - Has conflicts, needs rebasing as well.
  - 32: Audit: have MPI_T functions return MPI_ERR_INVALID_INDEX instead of MPI_ERR_INVALID_ITEM
  - 33: Deprecate MPI_SIZEOF
    - Is there a Fortran pragma to mark something as deprecated?
    - Would we be happy if it only worked for certain compilers?
      - Yes.
  - We can NOW use projects to manage issues.
[Howard] PMIx Event handling - which events do we want to handle?
- Don't have a default error handler (blocker) - Could have hung procs. Actually in the Sessions PR https://github.com/open-mpi/ompi/pull/9097 we do have a default error handler.
- In Sessions, left in ULFM proc_error_abort code, the test for ___ didn't work. He added a single process callback, and a second callback for o
- Might be something about the error handler that PMIx is calling, goes into an opal_event_list or something else.
- Places we could do better (instead of tearing down the job) - ASPIRE
- 3 types of event handlers (in priority order)
  1. single event handlers (look first)
  2. multi-code handlers, and aim it at a single event in a single call
  3. Default handler (pass in NULL), meaning all events will use this handler.
- As soon as a single event handler handles an event, PMIX doesn't call the more generic handlers.
- Can specify which processes use which handler.
- Can also specify to NOT use default handler.
  - ULFM code specified NOT to use the default handler. So ULFM needs it's own event handling.
- LOST_CONNECTION event handler is missing on master.
  - This is also blocked from going to default handler.
  - This is what allows processes to kill themselves when job is terminated by scheduler.
- Will need a PR before v5.0.0 [ACTION create a blocking issue for v5.0 to TRACK]
[Jeff] Uniform application of OFI and UCX component selection mechanisms
- What is the strategy that should be used for OFI and UCX components?
- E.g., https://github.com/open-mpi/ompi/issues/9123
  - Summary: User builds a "universal" Open MPI to use across several different clusters, including support for both OFI and UCX.
  - Somehow the Wrong thing is happening by default in a UCX-based cluster
- Action: Let's review what the current OFI / UCX selection mechanisms are
- @rhc54's proposal for fabric selection: https://github.com/open-mpi/ompi/issues/9123#issuecomment-877824063
- @jjhursey + @jsquyres proposal from 1+ year ago: mpirun --net TYPE where TYPE is defined in a text config file somewhere (i.e., customizable by the sysadmin), and basically gets transmorgaphied into a set of MCA params+values. E.g., mpirun --net TCP or mpirun --net UCX-TCP pulls the definition of those TYPEs from a config file containing:
```
# Definitions for mpirun --net TYPE
UCX-TCP = -mca plm ucx -x UCX_.._TLS tcp
TCP = -mca plm ob1 -mca tcp,vader,self
```
  - Could easily amend: use --net CLI option as the highest priority, and then take info from PMIx as fallback if user CLI option is not specified.
  - Issue: Do need to keep this simple enough to implement in a reasonable amount of time.
  - What if we add this to mpirun (that turns around and calls prun/prterun)?
    - That approach wouldn't work for non mpirun schedulers.
  - WHY would we do what the lower level already has selection mechanisms?
    - We don't like telling customers to configure OMPI and also configure the lower levels.
    - Intel MPI today has something like this, but it doesn't always do it "correctly", which is very confusing.
    - But this is duplicating behavior, and if we don't duplicate perfectly, we're causing confusion.
  - We should go by the "90%" rule. Most users will just want something simple (ex: "Use OFI"), some of these other items are for "advanced users"
- @bosilca's thought that we should be using */mca/common/* more (e.g., have multiple components of the same network type share common functionality for selection)
- @rhc54: Word of caution. This agenda item conflates two issues - the default selection of what should happen and the user-facing method of informing OMPI on what to do. We should resolve the default selection problem first and separately as this is what is causing the current problems. Default selection must work correctly whether you use "mpirun" or "srun" or "prun" (with a PRRTE DVM), and it should work correctly out-of-the-box without anyone (user or sysadmin) providing parameters.
- A customer is trying to build a "full featured" Open MPI with both libfabric and UCX support.
  - Trying to use UCX, but something in libfabric segved.
  - Defaults or user or priorities were all designed around the concept of "one and only one right answer". Wasn't any ambiguity, but that environment has changed.
  - libfabric and UCX are both marching towards having a superset of networks they support.
  - There are "better" choices based on performance.
- Defaults are hard to define and hard to get the correct. But if we had a magical oracle to figure that out, still have some issues between PMLs, MTLs, and BTLs around Initialization.
- Will need to solve this Centrally rather than spread around the code-base.
- What does it mean to say "OFI" (which component should be selected?)
- "Use TCP" - goes down a rat's nest of which component to use.
- Because of how we've implemented One-Sided, the BTLs are integral to how we do this.
  - Deprecating the One-Sided component makes the BTLs integral (Still correct decision tho)
- Don't know if we want to do device IDs...
  - If we want to do something "Linux-Only" we could iterate over devices.
  - Don't want to have to update some table every-time a new device is released.
  - Vendor-IDs might be sufficent
- Don't want separate components having separate device trees. Would rather have a centralized location in a single-order.
- Jeff advocating for Framework with multiple device-components.
  - Brian arguing against since those then need to be prioritized, and many very small components.
- Can't call lower level init calls (very heavy and other problems)
- C or Text file for mapping... not sure yet.
  - XML is a pain to parse
- Might want some wildcarding (everything from Vendor X do Y)
  - What is the output of this? List of components to use?
  - Maybe something like a global mca var to say "This is what should be used"
  - If output ends up with more than one thing, we're in an ambiguous state, so punt to user.
- This has always been a NICE to have, but now we REALLY NEED IT.
- Easy to maintain over time, vendors will want to update this over time.
- Selection of all components will influence of PMLs, BTLs, MTLs
- Need to keep this scalable. Don't want all nodes/procs reading from hwloc tree during init
  - PMIx does this once and puts it in shared memory.
  - CH4 - look first from PMIx, if can't get it then fallback.
  - PMIx already knows what fabrics are present, so could pass in Vendor IDs of NICs found.
  - either look at hwloc ourselves or get from PMIx (both?)
    - If we get in an ambiguous situation abort and ask user??
    - Probably the right thing to do, please tell us X
- Really trying to prevent initialization.
  - There are cases you'll see a device, but it's in an uninitialized state.
- When we have a new NIC from a new vendor and don't know what to do.
  - Really bad for customers, particularly because customers are stuck on older Open MPIs because their ISVs haven't done a new build.
  - We'd have to backport new vendor ids to very old OMPIs because OMPI changes it's ABI every few years.
    - This argues for Text file, so vendors can update and so could admins.
- What if we put it in mca param file? mpirun (and pmix/srun) sends that everywhere.
  - Two different formats in same text file? That's gross, but we do that now???
- CAN we express this information in a NOT HORRIBLE to read text file?
  - Could be a second file that PMIx forwards everywhere.
  - mca params gets expressed as env vars to application.
  - Other items get expressed as PMIx key/value pair.
- Any prioritization info here too?
  - No, just identification. If there's multiple matches, we go to ambiguous state.
- How are we going to get this DONE?
- What is going in this file?
  - Vendor and Part IDs and map this to a string?
  - Does new part force a new release for us?
- If PMIx is already giving this info, why would we do this ourselves?
  - Jeff and Ralph will put together a proposal
- Linux ONLY?
  - MacOS wants TCP
  - BSD as well.
- General scheme:
  - OMPI will get this "identification" information (PMIx/hwloc)
    - Will get a list of strings back, if we get 2 or more that's ambiguous, and ask user.
    - Get 1, use it.
    - Get 0: use TCP?
      - NO! That means you have a new Part.
      - Maybe just roll the dice on OFI / UCX?
  - Probably can't enumerate all ethernet cards in a sane way.
  - Don't forget about RoCE. :)
  - EFA is same VendorID but different DeviceID (for eth vs RDMA 'modes')
  - Linux-Only solution TCP/IP device "netdev". Mechanisms are different, but can figure out on MACOS as well.
- Stepping back, Vendors who have a preference between PMLs - REALLY the first level problem
  - EFA IDs has a big range, so can glob these.
  - Interesting if Mellanox has something similar.
  - After this, does it really matter? Since we've made a UCX/libfabric decision, lower level decisions are handled by the component.
- When there's 0 of UCX or EFI Devices, then fall down to BTLs?
  - Yes.
- Some discussion about a "network" framework to order items.
  - Could be the single static mca component framework
  - Bad abstraction break everywhere.
  - How does Opal decide provider selection, how does it know if it needs tag matching or not.
  - In many ways this is worse with OPAL/MPI split.
- VendorID (is the pci device id) is the same for Mellanox RoCE vs RDMA cards.
- Update - Jeff and Ralph have discussed for 2 hours, and have a few half-baked ideas, they'll bring back to community "soon"
July 29 [Josh] MPIR-Shim CI testing
- No MPIR support in Open MPI v5.0.x
- So, MPIR provides MPIR for tools, and converts to PMIx Tools interface
- https://github.com/openpmix/pmix-tests/pull/101
- SHIM also has it's own CI if anyone makes changes to SHIM
- CI has been running well. Part of CI for PMIx, so we'll know if this breaks.
  - There are tests in pmix-tests in addition to ibm tests
  - HUGE step up rather than waiting for customer to report that the tools are broken.
  - Tests launch and attach cases.
  - It's one of the existing ibm test cases, so will be included
- HOPE that we won't have to make more changes to SHIM (should be stable)
  - But because PMIx Tools interface changes, might need SHIM changes.
- Open MPI v5.0.0 docs need to document this well
- How to Integrate
  - This is a seperate github repository in openpmix ORG
  - User has to build it and user has to explicitly call it
  - Now that Open MPI has it's own mpirun COULD link it in. Then if you just mpirun a.out the magic is "just there"
    - Also have this standalone mode, where you mpirshim mpirun
    - MPIR does not have a direct launch mechanism, only indirect mechanism
    - What is we don't include it in Open MPI packaging, but include configure option --path-to-mpir-shim
      - External dependency, Don't include the code at all.
  - Practical side is that no tools that we know of use PMIx_Tools interface.
  - Options:
    1. Do nothing, and shim is a completely seperate thing
    2. Do not bundle shim package, but we provide a configure option to search for it, and if found, link it into mpirun
    3. We bundle the shim package, don't build it by default, on config flag build it into mpirun and command line stays the same
    4. We bundle the shim package, and autobuild it and mpirun command line stays the same
    - Is there an element of PMIx versioning that makes one of these better?
      - This is a real problem, since Totalview did this, but pmix tools interface changed underneath it.
      - This argues for NOT bundling it with Open MPI
      - COULD make configury more complex.
    - Leaning towards number 2, providing that the mpir shim is using a compatible PMIx library.
      - Nice thing about this, is a step towards transparency, but without us taking on all of the due-diligence we'd want to do.
      - We could add it to end of configure output, and show up in ompi_info output.
    - Number 2 along with a good doc will go a long way.
      - https://github.com/openpmix/mpir-to-pmix-guide
  - Library interface is VERY simple, so might need to be enchanced.
    - Few hours of work.
    - HELP NEEDED for v5.0.0 !!!
- Who owns MPIR shim?
  - openpmix is not interested in supporting a legacy interface.
  - They are still willing to host the repo
  - LICENSE? - BSD similar to Open MPI and PMIx
  - Does Open MPI want to own it?
  - Good reasons not to host in Open MPI organization
  - Those products who use it, should watch the mpir shim, and help with issues that come up.
  - Could add mpir-shim testing to Open MPI CI (or MTT)
July 29 [no particular owner] Plans for better support for GPU offload - do we have any? How important is this to our users?
- Intel has been working on Intel GPU support in OFI MTL
  - Has it mostly working for Contiguous buffers, except for some gpu buffer lookup caching mechanism.
- Some draft PR about this and abstract at the datatype level (This is using CUDA only)
  - Look at converter code, and may be able to abstract that out.
  - Intel [Gengbin Zheng] is intereated in this code, and abstracting this out to work for Intel GPU as well.
- Consider posting a patch, to solicit input.
  - George Bosilica is primary owner, and would be good to get input from.
- It's unclear who is the OFI-MTL maintainers in Open MPI
  - It's a collective effort.
- Schedule?
  - What's v5.0 timeframe? maybe fall? Unsure
  - CUDA support is only in OFI MTL, and crashes without PR 8822
  - Could get aligned with CUDA support in MTL by fall, but not sure about BTL support.
  - If BTL doesn't support it, does it fallback? Yes, to a copy-mechnaism in OB1?
  - Can provide OSU benchmark with GPU extention testing.
[Ralph] Overhaul "mpirun --help"
- Follow the Hydra model of a high-level help and then help/option (i.e., "mpirun --map-by --help"
- Ralph not here July 29th to discuss further.
- More online docs would be great.
- A little more complex because of Prrte side of it.
Discussed MPI Language bindings
- https://arxiv.org/pdf/2107.10566.pdf
- MPI Forum has a language bindings working group.
- Pythonization of the MPI Forum document can enable this.
[Brian] Open MPI ABI stability
- Our lack of a stable ABI across versions is very problematic for customers.
- Customers who are still stuck on Open MPI 1.10.
- If we go this route, the miriade of sublibraries
  - But users don't link against these sub-libraries.
  - One version of linker on Linux.
  - We bumped the non-backward compatible library (Github Issue, Debian brought it up) We broke ABI for OPAL, but not MPI. We versioned correctly, but this still broke the application. They had to relink the application.
- Might want to squash all libraries down.
- Should add tests for this.
- Doing this will break ABI again. So do we want this for v5.0.0?
- ldd (does recursion) of app, you'll see libopal, and libmpi
  - Looking at elf, still just libmpi.so (which links against libopal)
  - So today, just need to keep libmpi.so
- But still need to go back and study this Debian issue, but probably don't have as big of a problem as we did.
- Where are we on ABI only applies to MPI, and not internal. Is this still true?
  - We could make this true.... Two parts of what we mean by ABI.
    - ABI of the library is whatever's "public". Other libraries only bump version when subset of all "public"
  - Fortran links against both, so it's okay.
  - Fortran calls some non-public APIs, so we'd need to version these too.
  - Do you support using a libFortran from build A, and libmpi from build B where they might be different.
    - If the answer is NO, then we're okay
  - Could implement a runtime check fairly trivially.
    - But would have to be a library constructor since can't verify which is called first.
    - Possibly easier ways than this.
  - Fortran Precompiled Mod files.
    - So ABI from our point of view, is if they use the same Fortran compiler, we guarantee.
- Probably not a rush to fix for v5.0.0 since this problem has and will be an issue.
- Probably dont need to fold libopenpal into libmpi, since apps just link against libmpi, and it links against libopal
  - [Action someone needs to go re-review Debian issue]
- Does ANYONE still use OPAL directly?
- If MPI 5.0 standard is going to standardize ABI on MPICH ABI, we might not need to do much.
  - https://github.com/cea-hpc/wi4mpi
  - BUILD project also referenced.
  - Also an issue because some container archetectures are trying to pull MPI from outside of the container.
- NOTES FROM AFTER THE MEETING:
  - Brian + Jeff met to discuss ABI issues.
  - We verified that since Open MPI v4.0.x, mpicc and friends just do -lmpi (and Fortran libraries). Meaning: they do not specifically -lopen-pal.
    - ldd of an MPI app executable will somewhat-confusingly show you libopen-pal, but that's because ldd chases down the recursion.
  - Hence, we think that ABI actually isn't an issue for v5.0 and beyond.
  - Specifically, Brian did some testing. We think we're ok with OPAL ABI versioning issues, at least on modern linux. Brian tried Amazon Linux 2, RHEL 8, SLES 15, and Ubuntu 20.04, and all passed this test:
    - Build Open MPI master (c:r:a == 0:0:0 for all libraries) and install
    - Build MPI application against installation
    - Bump Open PAL's c:r:a to 1:0:0 and rebuild / install
    - Run mpi application built against original libmpi against new libmpi, verify it runs
  - As expected, ldd caught the change in library version number after the c:r:a change, so the right things happened.
  - Bottom line: I think we're ok with our current situation; no need to try and do the one big library thing (e.g., for v5.0.0), at least not right now.
- From July 29
  - How to TEST this?
  - Some tests to compile with version X and run with version Y
  - Also some tests to dump sizeof external MPI structures and compare.
  - Goal initially should be not to break the ABI.
  - Don't need to have them be CALLED, just need to have the runtime linker to resolve the symbols.
  - Could auto-generate this test from the Language bindings.
    - For C, mpif.h, use mpi, and mpiF08
    - Geoff to write up the issue. Good test for New to MPI
    - NEED THIS.
  - Open MPI v5.0 will have one set of MPI APIs
    - Lets say that in v5.1 we add some more MPI 4.0 stuff
    - So would need to be able to exclude these from ABI.
    - But will want to check this new stuff in future.
    - To be solved in future.
  - It's going to be very hard to promice much for Fortran Modulefiles
    - This is because Fortran compilers often change the format of their modulefiles, which are compiler version depend?nt
    - We should be very careful to test Fortran Modfiles with exact Fortran compiler version.
  - Nysal posted this to chat:
    - glibc uses this https://lvc.github.io/abi-compliance-checker/
    - This Looks pretty great. Might prevent us needing to write This test with calls to every MPI function, and allocate every type of struct.
    - This supports just C / C++, so might still need to cover Fortran with some tests.
  - We've been talking about this around Open MPI ABI, but this Might come up to the MPI Forum, and it would be either our API or MPICH's.
    - What would we do? Might be a massive change to our source base

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meeting 2021 07_Minutes

Open MPI Developer's July 2021 Virtual Meeting

Attendance

Agenda items

Clone this wiki locally