Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UR] Add initial spec for async alloc entry points #2180

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

hdelan
Copy link
Contributor

@hdelan hdelan commented Oct 8, 2024

First basic work in progress spec.

@github-actions github-actions bot added loader Loader related feature/bug common Changes or additions to common utilities specification Changes or additions to the specification experimental Experimental feature additions/changes/specification level-zero L0 adapter specific issues labels Oct 8, 2024
@hdelan hdelan force-pushed the async-alloc branch 3 times, most recently from de24f3d to 13392a0 Compare October 8, 2024 10:57
Copy link
Contributor

@pbalcer pbalcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time grasping how these functions are supposed to be used w.r.t. events.

With in-order queues, it's simple enough:

void *ptr;
urEnqueueUSMDeviceAllocExp(...,&ptr, 1024, ...); // allocate 1kb
urKernelSetArgPointer(..., ptr);
urEnqueueKernelLaunch(...);

urEnqueueUSMFreeExp(ptr, ...);

void *ptr2;
urEnqueueUSMDeviceAllocExp(...,&ptr2, 1024, ...); // allocate 1kb
assert(ptr == ptr2); // since this is an in-order queue, and we previously freed ptr, we can reuse that same object

urKernelSetArgPointer(..., ptr2);
urEnqueueKernelLaunch(...);

But if the queue is out-of-order, we face a situation where urEnqueueUSMDeviceAllocExp needs to figure out the dependencies of previously enqueued frees:

void *ptr;
urEnqueueUSMDeviceAllocExp(...,&ptr, 1024, ...); // allocate 1kb
urKernelSetArgPointer(..., ptr);

ur_event_handle_t kernel_one_out_event;
urEnqueueKernelLaunch(..., &kernel_one_out_event);

ur_event_handle_t free_out_event;
urEnqueueUSMFreeExp(ptr, waitlist = &[kernel_one_out_event], free_out_event);

void *ptr2;
ur_event_handle_t allocate_out_event;
urEnqueueUSMDeviceAllocExp(...,&ptr2, 1024, ..., waitlist = &[free_out_event], &allocate_out_event); // allocate 1kb. This takes the same events as enqueue kernel launch would.

assert(ptr == ptr2); // the implementation can figure out the ptr object will finish by the time ptr2 waitlist signals.

urKernelSetArgPointer(..., ptr2); // should this automatically add the ptr2 out event?
urEnqueueKernelLaunch(..., waitlist = &[allocate_out_event] , ...); // should this take allocate_out_event or kernel_one_out_event?

To me this looks like as if all the enqueue alloc/free functions have an implicit urEnqueueEventsWait built-in, and SYCL/users will need to account for that.

Theoretically, it should be possible to figure out all those dependencies/events by storing a list of active and pending uses for each pointer, eliminating the need for events.

@hdelan
Copy link
Contributor Author

hdelan commented Oct 8, 2024

But if the queue is out-of-order, we face a situation where urEnqueueUSMDeviceAllocExp needs to figure out the dependencies of previously enqueued frees:

void *ptr;
urEnqueueUSMDeviceAllocExp(...,&ptr, 1024, ...); // allocate 1kb
urKernelSetArgPointer(..., ptr);

ur_event_handle_t kernel_one_out_event;
urEnqueueKernelLaunch(..., &kernel_one_out_event);

Also bear in mind you will need to return an event from urEnqueueUSMDeviceAllocExp and add that as a dependency to the kernel launch.

To me this looks like as if all the enqueue alloc/free functions have an implicit urEnqueueEventsWait built-in, and SYCL/users will need to account for that.

That is correct. I think the goal of these funcs should be to have the same synchronization behaviour as other urEnqueue* entry points. I can't see why this might cause complications for users/SYCL.

Theoretically, it should be possible to figure out all those dependencies/events by storing a list of active and pending uses for each pointer, eliminating the need for events.

However one of the key things this approach is missing is the ability to pass dep events to urEnqueue*USMAllocExp. If I want an async allocation to happen only once say kernel A has completed, there is no way of expressing that without events. With an async free we can infer the dep events by seeing which events use the allocation, so maybe we don't need dep events for this, but I think it would be no harm.

For instance, we really need to be able to express this:

urEnqueueUSMFreeExp();
urEnqueueUSMDeviceAllocExp(); // I only want this allocation to be made once the previous free happens
                              // ie if I want to make sure I don't overflow the mem pool.

But if we don't return an event from Free and pass events to Alloc, then we can't express this just using ptrs for synchronization.

I'd be curious to see what others think about this. Ping @AerialMantis @kbenzie .

@hdelan
Copy link
Contributor Author

hdelan commented Oct 8, 2024

If we consider a subDAG for each async allocation starting at Alloc and ending at Free, then we could elide the events needed to make this subDAG work internally. But we would still need to be able to pass depEvents into Alloc (the start of the subDAG) and we also need to be able to pass out a recorded event from Free. Potentially an API would look something like this instead:

ur_result_t urEnqueueUSMBlahAllocExp(
    ur_queue_handle_t hQueue, ur_usm_pool_handle_t pPool, const size_t size,
    const ur_exp_async_usm_alloc_properties_t *pProperties,
    uint32_t numEventsInWaitList, const ur_event_handle_t *phEventWaitList,
    void **ppMem); // No out event
ur_result_t urEnqueueUSMFreeExp(ur_queue_handle_t hQueue,
                                ur_usm_pool_handle_t pPool, void *pMem,
                                ur_event_handle_t *phEvent); // No in events

I'm not sure if this is a lot better as it makes the APIs a bit inconsistent.

@pbalcer
Copy link
Contributor

pbalcer commented Oct 8, 2024

Yea, the example you gave makes sense... Hm, my main concern that we'd have to spend time here potentially creating new events, doing the relevant append wait for the waitlist events, and then signaling events, when all of this may not be necessary.

But I guess the implementation can avoid all that overhead if no signal event and no waitlist are provided, so ideally the enqueue alloc/free is very quick without heavyweight event-based synchronization.

In general, I agree that, all things being equal, it's better to be consistent.

@hdelan
Copy link
Contributor Author

hdelan commented Oct 8, 2024

Yea, the example you gave makes sense... Hm, my main concern that we'd have to spend time here potentially creating new events, doing the relevant append wait for the waitlist events, and then signaling events, when all of this may not be necessary.

I mean we can forego this overhead by using in order queues and without passing out events. I think when dealing with higher level out of order queues, performance is not as critical as in order queues, so an event here and there will not hurt. Regardless of whether we pass events explicitly to the entry points, if we need to do dep analysis for out of order queues, we will need to record events, either explicitly when we pass events, or implicitly by using ptrs.

But I guess the implementation can avoid all that overhead if no signal event and no waitlist are provided, so ideally the enqueue alloc/free is very quick without heavyweight event-based synchronization.

Yes exactly.

@pbalcer
Copy link
Contributor

pbalcer commented Oct 8, 2024

If we consider a subDAG for each async allocation starting at Alloc and ending at Free, then we could elide the events needed to make this subDAG work internally.

But implementations should do this regardless (for every applicable object in the pool), because we need to return a pointer immediately. Ideally that pointer does not require the subsequent kernel execution to wait, or wait the least possible amount of time. From what I can tell, cuda even allows you to forbid the runtime from allocating objects that would cause an implicit synchronization (cudaMemPoolReuseAllowInternalDependencies).

@hdelan
Copy link
Contributor Author

hdelan commented Oct 8, 2024

But implementations should do this regardless (for every applicable object in the pool), because we need to return a pointer immediately. Ideally that pointer does not require the subsequent kernel execution to wait, or wait the least possible amount of time. From what I can tell, cuda even allows you to forbid the runtime from allocating objects that would cause an implicit synchronization (cudaMemPoolReuseAllowInternalDependencies).

Yes the pointer is immediately valid so no blocking should occur. However there are some nuances with this, ie if allocating a mem pool for the first time there will be a blocking wait before the kernel starts. This is all CUDA backend dependent. The main idea is that once a mem pool has been made, async allocs can be reused without needing expensive CUDA blocking allocations to be made. See here https://developer.nvidia.com/blog/enhancing-memory-allocation-with-new-cuda-11-2-features/#stream-ordered_memory_allocator

@hdelan
Copy link
Contributor Author

hdelan commented Oct 8, 2024

Also just to clarify on an earlier point.

void *ptr;
urEnqueueUSMDeviceAllocExp(...,&ptr, 1024, ...); // allocate 1kb
urKernelSetArgPointer(..., ptr);

ur_event_handle_t kernel_one_out_event;
urEnqueueKernelLaunch(..., &kernel_one_out_event);

ur_event_handle_t free_out_event;
urEnqueueUSMFreeExp(ptr, waitlist = &[kernel_one_out_event], free_out_event);

void *ptr2;
ur_event_handle_t allocate_out_event;
urEnqueueUSMDeviceAllocExp(...,&ptr2, 1024, ..., waitlist = &[free_out_event], &allocate_out_event); // allocate 1kb. This takes the same events as enqueue kernel launch would.

assert(ptr == ptr2); // the implementation can figure out the ptr object will finish by the time ptr2 waitlist signals.

There would be no way to provide such a guarantee (ptr == ptr2) in UR as we are not in control of the native allocator, which may have its own unusual heuristics.

@pbalcer
Copy link
Contributor

pbalcer commented Oct 8, 2024

But implementations should do this regardless (for every applicable object in the pool), because we need to return a pointer immediately. Ideally that pointer does not require the subsequent kernel execution to wait, or wait the least possible amount of time. From what I can tell, cuda even allows you to forbid the runtime from allocating objects that would cause an implicit synchronization (cudaMemPoolReuseAllowInternalDependencies).

Yes the pointer is immediately valid so no blocking should occur. However there are some nuances with this, ie if allocating a mem pool for the first time there will be a blocking wait before the kernel starts. This is all CUDA backend dependent. The main idea is that once a mem pool has been made, async allocs can be reused without needing expensive CUDA blocking allocations to be made. See here https://developer.nvidia.com/blog/enhancing-memory-allocation-with-new-cuda-11-2-features/#stream-ordered_memory_allocator

Yea, I've been looking at that as well. My point wasn't about allocating physically-backed pages (i.e., blocking on a page fault or something), but rather on blocking/waiting on events that originate from urEnqueueUSMBlahAllocExp. If possible, this function should return a pointer that has no/minimal dependencies for being reused given the waitlist. In other words, an optimal implementation may need to internally create a DAG of dependencies for ptrs, and then, at allocation time, figure out which objects will be reusable (or have the least amount of outstanding dependencies) once the events on the waitlist are signaled.

Anyway, I'm convinced. I think an in-order implementation should be straightforward, the out-of-order one will be difficult regardless of how we do the API. I think it will fine if we do a good-enough approximate solution in the latter case.

@hdelan
Copy link
Contributor Author

hdelan commented Oct 8, 2024

Thanks for discussion @pbalcer !

@igchor
Copy link
Member

igchor commented Oct 8, 2024

This looks good. I like the approach with explicit event wait list and signal events and I agree this should be consistent with other enqueue APIs.

For cases where there is only a single event in the wait list (which should be a fairly common case) the event elision should be fairly simple. For some pool implementations (e.g. what we discussed for L0) I guess we could just bump the reference of the input event and return it as a signal event.

First basic work in progress spec.
Add an entry so the user can specify if the native USM pool should be
used.
@github-actions github-actions bot added cuda CUDA adapter specific issues hip HIP adapter specific issues native-cpu Native CPU adapter specific issues labels Oct 22, 2024
@kswiecicki
Copy link
Contributor

Hi @hdelan, I’m currently working on the L0 adapter implementation for the recent spec changes. Would it be possible to separate the spec changes from the implementation of the API that was added to the spec? This would help reduce clutter in PRs that only depend on the spec changes.

@hdelan
Copy link
Contributor Author

hdelan commented Dec 10, 2024

Hi @hdelan, I’m currently working on the L0 adapter implementation for the recent spec changes. Would it be possible to separate the spec changes from the implementation of the API that was added to the spec? This would help reduce clutter in PRs that only depend on the spec changes.

Hi @kswiecicki sorry I missed that comment. I see you added the spec changes here. Thanks for doing so. I'd be inclined to also mark that PR as a [draft] since some further spec changes may be necessary. This spec PR is not ready for a final merge as some things are still up for discussion. I am working on a new team now so this has been on pause for a while, but I will try to get back to this soon so that the API can be finalized.

@github-actions github-actions bot added the opencl OpenCL adapter specific issues label Dec 17, 2024
@hdelan hdelan force-pushed the async-alloc branch 7 times, most recently from 4fbfbc4 to 3088c6e Compare December 20, 2024 11:12
Link some UR mem pool flags up to their CUDA equivalents. There are a
lot of outstanding TODOs for flags whose mapping/purpose is unknown.
Also some small fixes here and there in the CUDA adapter.
@github-actions github-actions bot added the conformance Conformance test suite issues. label Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Changes or additions to common utilities conformance Conformance test suite issues. cuda CUDA adapter specific issues experimental Experimental feature additions/changes/specification hip HIP adapter specific issues level-zero L0 adapter specific issues loader Loader related feature/bug native-cpu Native CPU adapter specific issues opencl OpenCL adapter specific issues specification Changes or additions to the specification
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants