Provide API to map real device ID to parsec ID and back #660

devreal · 2024-06-04T13:19:00Z

Description

PaRSEC numbers its devices differently from the rest of world (starting with 0 for host, possibly 1 for recursive, 2... for devices). CUDA and HIP start numbering devices from 0.

Describe the solution you'd like

An efficient API to map between CUDA/HIP IDs and parsec device IDs.

Describe alternatives you've considered

TTG currently implements this mapping but it requires us to find the first device ID and then add that to the CUDA/HIP ID to get the parsec ID. That seems brittle and would fail if we ever support multiple device types.

Additional context

Add any other context, references, and related works about the feature request here.

bosilca · 2024-06-04T17:50:21Z

Why "different than the rest of the world" ? PaRSEC number its devices starting from 0 and includes all supported devices. Exactly like all the others.

As a user you are not supposed to address an accelerator directly, everything you can do should happen in the context provided by the runtime. In this context you do have access to the real accelerator id via the device_t.

What exactly are you trying to do that would require the real device number ? And in what context ?

therault · 2024-06-04T17:55:29Z

The context is a TTG application.

The need is to be able to advise the runtime to schedule a given task on a given device.

The issue is that PaRSEC does not exist for the programmer of a TTG application: only TTG exists. So, it doesn't make sense for a TTG application programmer to find a parsec_device_t.

So, TTG needs to expose a concept of device or device index, and there should be a portable way (for the TTG implementation with the PaRSEC backend) to convert from / to a TTG device / device index to the actual runtime device / device index.

bosilca · 2024-06-07T21:03:08Z

PaRSEC has its own dialect to identify devices, a dialect (and the API going with it) that TTG chooses not to expose to users. Thus, it seems more reasonable to have TTG provide the conversion between TTG and PaRSEC devices instead of forcing PaRSEC to speak the TTG dialect for naming devices.

therault · 2024-06-07T22:09:34Z

But it can also be useful for most applications that do device binding, whatever the DSL.

If we take gemm_gpu in DPLASMA for example, we start by counting how many CUDA GPUs we have (nbgpus), then we create an array that maps the space [0, nbgpus-1] to the actual device number; in the PR from Qinglei to give scheduling advice to POTRF, he needs to do the same thing. Most GPU-enabled tests also compute the number of GPU available, and some need to create a map in order to easily express that they want to bind a data or task to a specific GPU.

In the end, every time the user wants to map either data or tasks to a specific device, they end up re-inventing a way to do this mapping. The proposal is just to provide an API that they can use (or ignore).

abouteiller · 2024-06-14T14:47:06Z

also seen in ICLDisco/dplasma#118

devreal added the enhancement New feature or request label Jun 4, 2024

abouteiller mentioned this issue Jun 25, 2024

Add parsec_advise_data_on_device for zpotrf_L ICLDisco/dplasma#118

Merged

abouteiller added this to the v4.1 milestone Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide API to map real device ID to parsec ID and back #660

Provide API to map real device ID to parsec ID and back #660

devreal commented Jun 4, 2024

bosilca commented Jun 4, 2024

therault commented Jun 4, 2024

bosilca commented Jun 7, 2024

therault commented Jun 7, 2024

abouteiller commented Jun 14, 2024

Provide API to map real device ID to parsec ID and back #660

Provide API to map real device ID to parsec ID and back #660

Comments

devreal commented Jun 4, 2024

Description

Describe the solution you'd like

Describe alternatives you've considered

Additional context

bosilca commented Jun 4, 2024

therault commented Jun 4, 2024

bosilca commented Jun 7, 2024

therault commented Jun 7, 2024

abouteiller commented Jun 14, 2024