-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UR] Add initial spec for async alloc entry points #2180
base: main
Are you sure you want to change the base?
Conversation
de24f3d
to
13392a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time grasping how these functions are supposed to be used w.r.t. events.
With in-order queues, it's simple enough:
void *ptr;
urEnqueueUSMDeviceAllocExp(...,&ptr, 1024, ...); // allocate 1kb
urKernelSetArgPointer(..., ptr);
urEnqueueKernelLaunch(...);
urEnqueueUSMFreeExp(ptr, ...);
void *ptr2;
urEnqueueUSMDeviceAllocExp(...,&ptr2, 1024, ...); // allocate 1kb
assert(ptr == ptr2); // since this is an in-order queue, and we previously freed ptr, we can reuse that same object
urKernelSetArgPointer(..., ptr2);
urEnqueueKernelLaunch(...);
But if the queue is out-of-order, we face a situation where urEnqueueUSMDeviceAllocExp
needs to figure out the dependencies of previously enqueued frees:
void *ptr;
urEnqueueUSMDeviceAllocExp(...,&ptr, 1024, ...); // allocate 1kb
urKernelSetArgPointer(..., ptr);
ur_event_handle_t kernel_one_out_event;
urEnqueueKernelLaunch(..., &kernel_one_out_event);
ur_event_handle_t free_out_event;
urEnqueueUSMFreeExp(ptr, waitlist = &[kernel_one_out_event], free_out_event);
void *ptr2;
ur_event_handle_t allocate_out_event;
urEnqueueUSMDeviceAllocExp(...,&ptr2, 1024, ..., waitlist = &[free_out_event], &allocate_out_event); // allocate 1kb. This takes the same events as enqueue kernel launch would.
assert(ptr == ptr2); // the implementation can figure out the ptr object will finish by the time ptr2 waitlist signals.
urKernelSetArgPointer(..., ptr2); // should this automatically add the ptr2 out event?
urEnqueueKernelLaunch(..., waitlist = &[allocate_out_event] , ...); // should this take allocate_out_event or kernel_one_out_event?
To me this looks like as if all the enqueue alloc/free functions have an implicit urEnqueueEventsWait
built-in, and SYCL/users will need to account for that.
Theoretically, it should be possible to figure out all those dependencies/events by storing a list of active and pending uses for each pointer, eliminating the need for events.
Also bear in mind you will need to return an event from
That is correct. I think the goal of these funcs should be to have the same synchronization behaviour as other
However one of the key things this approach is missing is the ability to pass dep events to For instance, we really need to be able to express this: urEnqueueUSMFreeExp();
urEnqueueUSMDeviceAllocExp(); // I only want this allocation to be made once the previous free happens
// ie if I want to make sure I don't overflow the mem pool. But if we don't return an event from I'd be curious to see what others think about this. Ping @AerialMantis @kbenzie . |
If we consider a subDAG for each async allocation starting at ur_result_t urEnqueueUSMBlahAllocExp(
ur_queue_handle_t hQueue, ur_usm_pool_handle_t pPool, const size_t size,
const ur_exp_async_usm_alloc_properties_t *pProperties,
uint32_t numEventsInWaitList, const ur_event_handle_t *phEventWaitList,
void **ppMem); // No out event
ur_result_t urEnqueueUSMFreeExp(ur_queue_handle_t hQueue,
ur_usm_pool_handle_t pPool, void *pMem,
ur_event_handle_t *phEvent); // No in events I'm not sure if this is a lot better as it makes the APIs a bit inconsistent. |
Yea, the example you gave makes sense... Hm, my main concern that we'd have to spend time here potentially creating new events, doing the relevant append wait for the waitlist events, and then signaling events, when all of this may not be necessary. But I guess the implementation can avoid all that overhead if no signal event and no waitlist are provided, so ideally the enqueue alloc/free is very quick without heavyweight event-based synchronization. In general, I agree that, all things being equal, it's better to be consistent. |
I mean we can forego this overhead by using in order queues and without passing out events. I think when dealing with higher level out of order queues, performance is not as critical as in order queues, so an event here and there will not hurt. Regardless of whether we pass events explicitly to the entry points, if we need to do dep analysis for out of order queues, we will need to record events, either explicitly when we pass events, or implicitly by using ptrs.
Yes exactly. |
But implementations should do this regardless (for every applicable object in the pool), because we need to return a pointer immediately. Ideally that pointer does not require the subsequent kernel execution to wait, or wait the least possible amount of time. From what I can tell, cuda even allows you to forbid the runtime from allocating objects that would cause an implicit synchronization ( |
Yes the pointer is immediately valid so no blocking should occur. However there are some nuances with this, ie if allocating a mem pool for the first time there will be a blocking wait before the kernel starts. This is all CUDA backend dependent. The main idea is that once a mem pool has been made, async allocs can be reused without needing expensive CUDA blocking allocations to be made. See here https://developer.nvidia.com/blog/enhancing-memory-allocation-with-new-cuda-11-2-features/#stream-ordered_memory_allocator |
Also just to clarify on an earlier point.
There would be no way to provide such a guarantee ( |
Yea, I've been looking at that as well. My point wasn't about allocating physically-backed pages (i.e., blocking on a page fault or something), but rather on blocking/waiting on events that originate from Anyway, I'm convinced. I think an in-order implementation should be straightforward, the out-of-order one will be difficult regardless of how we do the API. I think it will fine if we do a good-enough approximate solution in the latter case. |
Thanks for discussion @pbalcer ! |
This looks good. I like the approach with explicit event wait list and signal events and I agree this should be consistent with other enqueue APIs. For cases where there is only a single event in the wait list (which should be a fairly common case) the event elision should be fairly simple. For some pool implementations (e.g. what we discussed for L0) I guess we could just bump the reference of the input event and return it as a signal event. |
First basic work in progress spec.
Add an entry so the user can specify if the native USM pool should be used.
Hi @hdelan, I’m currently working on the L0 adapter implementation for the recent spec changes. Would it be possible to separate the spec changes from the implementation of the API that was added to the spec? This would help reduce clutter in PRs that only depend on the spec changes. |
Hi @kswiecicki sorry I missed that comment. I see you added the spec changes here. Thanks for doing so. I'd be inclined to also mark that PR as a [draft] since some further spec changes may be necessary. This spec PR is not ready for a final merge as some things are still up for discussion. I am working on a new team now so this has been on pause for a while, but I will try to get back to this soon so that the API can be finalized. |
4fbfbc4
to
3088c6e
Compare
Link some UR mem pool flags up to their CUDA equivalents. There are a lot of outstanding TODOs for flags whose mapping/purpose is unknown.
Also some small fixes here and there in the CUDA adapter.
First basic work in progress spec.