core: Introduce new inject calls that take buffer descriptor #10433

j-xiong · 2024-10-02T21:43:50Z

Existing inject calls (for msg, tagged, rma, and atomic ops) don't provide a way to supply the memory descriptor of the data buffer. This presents difficulty when FI_HMEM is enabled. Providers either have to detect the HMEM interface the buffer is allocated with, or declare the supported inject_size be zero. Both may result in sub-optimal application performance.

Here a set of new inject calls are introduced. The difference of the new calls vs the corresponding existing calls is the addition of a new parameter "desc", like what have already existed in the send calls. With the new calls, the provider is able to know the iface of the data buffer and handle it properly.

Existing inject calls (for msg, tagged, rma, and atomic ops) don't provide a way to supply the memory descriptor of the data buffer. This presents difficulty when FI_HMEM is enabled. Providers either have to detect the HMEM interface the buffer is allocated with, or declare the supported inject_size be zero. Both may result in sub-optimal application performance. Here a set of new inject calls are introduced. The difference of the new calls vs the corresponding existing calls is the addition of a new parameter "desc", like what have already existed in the send calls. With the new calls, the provider is able to know the iface of the data buffer and handle it properly. Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

shefty · 2024-10-02T22:22:30Z

A purpose of the inject call is that the buffer doesn't have to be registered. If it needs to be registered, the user can go through any of the other send operations. It seems the gap here is to disallow hmem buffers from being passed into the inject call if FI_MR_HMEM is set.

j-xiong · 2024-10-02T22:46:50Z

The actual usage of the inject calls can deviate. There are many cases that the buffer is pre-registered anyway (e.g. MPI eager buffer), and the user just wants the "not holding the buffer and no completion" semantics of the inject calls. Although that can be achieved with regular send with FI_INJECT flag and selective completion, but it is more cumbersome that way.

shefty · 2024-10-02T23:11:08Z

The normal send call is usable, however. Here, the app has already registered the buffer. But the app wants the provider to copy the GPU memory somewhere, so that when the call returns the app can own the buffer again. I'm struggling to see the benefit of introducing a new call or to follow the application use case. Copying off GPU memory isn't exactly fast, which is what the inject call targets.

j-xiong · 2024-10-03T04:02:28Z

The difficulty comes from handling the inject calls over the LNX provider when the underlying providers have different inject size and FI_MR_HMEM settings. For example, the CXI provider doesn't need FI_MR_HMEM mode and can always return positive inject size; but the SHM provider requires FI_MR_HMEM mode and has to return 0 inject size when FI_HMEM is enabled. If the LNX provider takes the conservative way by setting inject size to 0 when FI_HMEM is enabled, the application can't use inject calls at all even if the traffic goes through CXI. If the LNX provider setting the inject size to the value returned by CXI, it needs to handle device buffer manually if the traffic chooses SHM -- either register + copy + inject, or register + send. The registration here can be redundant if the buffer has already been registered by the application. Application can call these new inject functions if the buffer has already been registered, saving the provider from a duplicated registration.

shefty · 2024-10-03T16:42:58Z

The problem is the current inject calls are not usable. That needs to be indicated somehow, e.g. by restricting inject to host buffers only by definition, or have separate inject sizes for host vs hmem buffers or updating FI_MR_HMEM definition somehow. That needs to be fixed, otherwise, we're simply introducing magic developer required knowledge as to why one call needs to be used over another.

In this case, new API calls are for convenience. They aren't needed and likely have minimal performance benefit over a common path.

If LNX can return an inject size for transfers that go over SHM, then SHM should be able to support the same inject size directly.

amirshehataornl · 2024-10-03T16:53:07Z

When you say: "A purpose of the inject call is that the buffer doesn't have to be registered. " What line of reasoning was used to come to that determination?

shefty · 2024-10-03T17:03:28Z

Because there's no damn place to provide the memory descriptor. Seems pretty damn obvious.

amirshehataornl · 2024-10-03T17:19:39Z

Don't understand the tone of the response. It does not answer my question.

The lack for the descriptor in the original APIs might be due to a historical reason? That was my question and your answer says that the reason the descriptor is not there, is because it's not there.

As far as I understand the inject call is used to inject the data in the message in order not to incur the performance penalty of a more complex protocol. With the introduction of HMEM there is now a reason to add the descriptor, because the data to be injected resides in HMEM. The way CXI got around this is to do a lookup on the address and cache the result. At least in the context of CXI the inject path actually adds some significant performance gains.

aingerson · 2024-10-03T17:35:25Z

I think what grumpy Sean is saying is that the behavior of the inject (in terms of dropping the completion and returning the buffer immediately) can be replicated with the sendmsg call. So the only additional thing the call is adding is that you don't need to register the buffer, hence the original reasoning behind the call. If we add fi_inject2 with the descriptor, it conflicts with what that call is providing (lack of registration). So what we should be doing is specifying that the inject calls can only be called with host memory (or memory does not have to be registered) and then move the middlewares to using fi_sendmsg() to replicate the inject behavior that provides the optimization for device memory if desired.

shijin-aws · 2024-10-03T17:47:52Z

I agree with Sean here, I prefer to keep fi_*inject* call without MR descriptor to indicate an MR is not needed & the send buffer is host type (when FI_MR_HMEM is required) so we have a clear logic.

j-xiong force-pushed the inject2 branch from d827dfd to 305f9aa Compare October 2, 2024 21:44

j-xiong requested review from raffenet, shefty, swelch, hppritcha, aingerson, belynam, shijin-aws, iziemba and amirshehataornl October 2, 2024 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: Introduce new inject calls that take buffer descriptor #10433

core: Introduce new inject calls that take buffer descriptor #10433

j-xiong commented Oct 2, 2024

shefty commented Oct 2, 2024

j-xiong commented Oct 2, 2024

shefty commented Oct 2, 2024

j-xiong commented Oct 3, 2024

shefty commented Oct 3, 2024

amirshehataornl commented Oct 3, 2024

shefty commented Oct 3, 2024

amirshehataornl commented Oct 3, 2024

aingerson commented Oct 3, 2024 •

edited

Loading

shijin-aws commented Oct 3, 2024 •

edited

Loading

core: Introduce new inject calls that take buffer descriptor #10433

Are you sure you want to change the base?

core: Introduce new inject calls that take buffer descriptor #10433

Conversation

j-xiong commented Oct 2, 2024

shefty commented Oct 2, 2024

j-xiong commented Oct 2, 2024

shefty commented Oct 2, 2024

j-xiong commented Oct 3, 2024

shefty commented Oct 3, 2024

amirshehataornl commented Oct 3, 2024

shefty commented Oct 3, 2024

amirshehataornl commented Oct 3, 2024

aingerson commented Oct 3, 2024 • edited Loading

shijin-aws commented Oct 3, 2024 • edited Loading

aingerson commented Oct 3, 2024 •

edited

Loading

shijin-aws commented Oct 3, 2024 •

edited

Loading