-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sessions: add support for ucx and more #12723
base: main
Are you sure you want to change the base?
Conversation
@rhc54 could you check the PMIx usage here? |
Running AWS CI |
hmmm that coredump is coming from prrte -
|
rc = PMIx_Group_construct(tag, procs, proc_count, &pinfo, 1, &results, &nresults); | ||
PMIX_INFO_DESTRUCT(&pinfo); | ||
rc = PMIx_Group_construct(tag, procs, proc_count, pinfo, ninfo, &results, &nresults); | ||
PMIX_DATA_ARRAY_DESTRUCT(&darray); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There must be some optimizations behind this centralized group creation for it to scale. I would love to hear more about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on the RTE, of course, as all PMIx does is construct the request and pass it up. For PRRTE, it is just the tree-based allgather we use for pretty much all collectives. I don't believe anyone has exerted much effort towards optimizing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly what I'm concerned about as I have seen no performance report on this change.
Not surprising - we know that the group construct code in PRRTE master has problems. I've been working on it and am almost done with the fix, but am getting pulled in multiple directions right now. So it is sitting off to the side for now. I'll try to look thru the PMIx stuff as time permits. |
I did some digging in to this and it turns out that the Open MPI is feeding a corrupted proc list into |
You mean one proc in the array has a bad nspace? Yeah, I'm not sure how we check for a bad string, though we do know the max size of the nspace - so we should not be charging off into nowhere land. |
a namespace sanity check in prrte group construct may have caught this but not sure. |
7f17065
to
23865b7
Compare
FYI AWS CI passed for this PR |
@wenduwan could you rerun AWS CI? |
fbe5c47
to
e7ef1be
Compare
@wenduwan when you have a chance could you check changes per your comments? |
@janjust please review - esp. UCX part when you get a chance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments have been addressed. Also passed AWS CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to understand the change, but overall I think this is a good and necessary move. Two comments:
- I would move it up from the PML comm into the OMPI comm, and then it will be automatic for all MTL/PML;
- I would be interested in seeing some performance at scale (we now have a critical component depending on a software we have no control over).
The dependency has been true for a very long time, actually - OMPI doesn't run at all without PMIx and hasn't for years. In particular for this PR, PMIx has been in the path for connect/accept for years, and the collective used here is the same as the collective currently in that path. The only difference being gradually introduced is to replace the multiple calls to PMIx functions combined with a bunch of MPI ops with a single call to the PMIx group op. So you should eventually see a performance improvement, although it may take another PR step to get it as I withdrew the PR that consolidated to a single op. This just takes an initial step in that direction. As for control - OMPI continues to have (in some opinions, an undue) significant influence over PMIx. As noted elsewhere, the only real "disconnect" has been the lack of code contribution by OMPI members, which means that things like faster "allgather" in PRRTE don't get implemented - not due to refusal, but just due to lack of contribution. |
Just for clarification - we only use the |
((((uint64_t) (_tag) ) << (PML_UCX_RANK_BITS + PML_UCX_CONTEXT_BITS)) | \ | ||
(((uint64_t)(_comm)->c_my_rank ) << PML_UCX_CONTEXT_BITS) | \ | ||
((uint64_t)(_comm)->c_index)) | ||
((uint64_t)(_c_index))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we only have 20bits here, shall we check that cid
is always lower than 2^20? or somehow we are guaranteed that cid is within limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who has only 20 bits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PML_UCX_CONTEXT_BITS
is 20 bits so if _c_index
was to ever be bigger, it would spill on the ->c_my_rank
field, maybe that's no the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guess i'm curious - OMPI's communicator index has always been 32-bits. So have you just been lucky because the index never grew big enough to violate your 20-bit threshold? If so, then it seems you have a design problem as there is no guarantee that the index won't be greater than 20-bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - pml_max_contextid field in UCX module definition should guarantee the communicator index being below 20bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but in connect/accept and comm_spawn operations, the context id is being set by PMIx, which provides a 32-bit ID. There are other places in the code that also set the ID, not just in the PML. So how do you handle ID's beyond 20-bit? Or is UCX simply limited to 20-bit ID's and fails if OMPI sets anything above that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually the cid in the dpm.c code is being generated in the call to ompi_comm_nextcid
and using the "OLD" algorithm for allocating cids in the ompi_comm_nextcid_nb
function since the MPI functions supported in dpm.c for communicator construction always require an input communicator.
I have seen the question of the size of the context ID raised several times, so let me provide a little perspective on it. In PMIx, the context ID for a group is specified to be @hppritcha had previously (a long time ago, now) requested that PRRTE start assigning values at I don't really understand all the bit manipulation going on here, so I am probably missing something - but it strikes me that if you are limiting the context ID to something less than 32-bits, and using the CID returned by PMIx, then you have a problem. |
Okay sorry for not getting back to this thread sooner. Just very busy in my ompi time with other things at the moment. There's also a variant of this for the MTL components. Anyway this quantity is used for controlling max value |
Also the PMIx group id is now only used to extract modex data as can be seen in the code changes in this PR. The excid that Nathan developed (which used pmix group id in a 128 bit extended context id) is not really used now except in OB1. Eventually we'll remove that but not as part of 6.0.x. It was one of those ideas that's good in theory, but in the general case of external libraries providing the lower level APIs for moving data round, it's not really maintainable. The excid stuff is described in https://doi.org/10.1109/CLUSTER.2019.8891002 |
Sooo...you are telling me that I went thru all the trouble of adding the ability for PMIx and PRRTE to provide context IDs as part of the group operation that returns modex data - for nothing? You'd rather not only do the group op, but also do a kluge multi-stage operation to create the ID?? |
Huh? look at the code changes in this PR for file ompi/communicator/comm_cid.c . there you see we do indeed use the pmix group context id to look up the local cid for the target receiver. We just are now looking up the receivers local cid in a different way than nathan's difficult to maintain (much less implement in PMLs besides OB1) out-of-band bootstrap method. |
Okay, @hppritcha - I confess to confusion. Your statement was:
I was reacting to this statement as it implied (as I read it) that the use of PMIx to generate a context ID was something to be removed as PMIx is the only "external library" involved (to the best of my knowledge). I confess I haven't read thru this PR to try and fully grok all the changes, so if I have misunderstood, I apologize for it. I do see the use of PMIx group ops - just was taking the understanding that you intended to remove those at some post 6.0 move. AFAIK, the group operation is working correctly in terms of generating IDs. I know there is a problem in PRRTE/PMIx master branches (and in their release branches) regarding correctly collecting all required info to be exchanged. I've been working on it, albeit slowly. Had to step away for awhile, but am now gradually attempting to restart it. |
Greatly simplify support for MPI_Comm_create_from_group and MPI_Intercomm_create_from_group by removing the need to support the 128-bit excid notion. Instead, make use of a PMIx capability - PMIX_GROUP_LOCAL_CID and the notion of PMIX_GROUP_INFO. This capability was introduced in Open PMIx 4.1.3. This capability allows us to piggy-back a local cid selected for the new communicator on the PMIx_Group_construct operation. Using this approach, a lot of the complex active message style operations implemented in the OB1 PML to support excids can be avoided. This PR also includes simplifications to the OFI MTL to make use of the PMIX_GROUP_LOCAL_CID feature. Infrastructure for debugging communicator management routines was also introduced, along with a new MCA parameter - mpi_comm_verbose. Related to open-mpi#12566 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
010f176
to
4fe0431
Compare
@janjust please review when you have a chance. i think i've addressed the various items noted by the nivida reviewers. |
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Greatly simplify support for MPI_Comm_create_from_group and MPI_Intercomm_create_from_group by removing the need to support the 128-bit excid notion.
Instead, make use of a PMIx capability - PMIX_GROUP_LOCAL_CID and the notion of PMIX_GROUP_INFO. This capability was introduced in Open PMIx 4.1.3. This capability allows us to piggy-back a local cid selected for the new communicator on the PMIx_Group_construct operation. Using this approach, a lot of the complex active message style operations implemented in the OB1 PML to support excids can be avoided.
This PR also includes simplifications to the OFI MTL to make use of the PMIX_GROUP_LOCAL_CID feature.
Infrastructure for debugging communicator management routines was also introduced, along with a new MCA parameter - mpi_comm_verbose.
Related to #12566