Options for GPU Sharing between Containers Running on a Workstation #1769

frenchwr · 2024-06-27T22:20:08Z

Describe the support request
Hello, I'm trying to understand options that would allow multiple containers to share a single GPU.

I see that K8s device plugins in general are not meant to allow a device to be shared between containers.

I also see from the GPU plugin docs in this repo that there is a sharedDevNum that can be used for sharing a GPU, but I infer this is partitioning the resources on the GPU so each container is only allocated a fraction of the GPU's resources. Is that correct?

My use case is a tool called data-science-stack that is being built to automate the deployment/management of GPU-enabled containers for quick AIML experimentation on a user's laptop or workstation. In this scenario we'd prefer the containers have the ability to each have access to the full GPU resources - much like you'd expect for applications running directly on the host. Is this possible?

System (please complete the following information if applicable):

OS version: Ubuntu 22.04
Kernel version: Linux 5.15 (HWE kernel for some newer devices)
Device plugins version: v0.29.0 and v0.30.0 are the versions I've worked with
Hardware info: iGPU and dGPU

The text was updated successfully, but these errors were encountered:

eero-t · 2024-06-28T11:44:46Z

sharedDevNum is mostly intended to be used when either:

User running the workloads manually takes care of not oversubscribing the GPUs (e.g. in test cluster),
Cluster runs only GPU workload(s) which can share GPU up to sharedDevNum (e.g. cluster dedicated to running a single GPU workload), or
Pod specs indicate how much resources each container uses from GPU, and cluster is running GAS [1] that shares GPUs based on the specified resource consumption
- There's no enforcement of limiting workload to the specified amount, those values are used only for scheduling / device selection

[1] GPU Aware Scheduling: https://github.com/intel/platform-aware-scheduling/tree/master/gpu-aware-scheduling

In this scenario we'd prefer the containers have the ability to each have access to the full GPU resources

If each container (in cluster) is supposed to have exclusive access to the GPU device, use 1 for sharedDevNum.

frenchwr · 2024-06-28T14:47:19Z

If each container (in cluster) is supposed to have exclusive access to the GPU device, use 1 for sharedDevNum.

But this does not allow the GPU to be shared between containers, correct?

Maybe a bit more context about the use case would help. We are building an application that simplifies the deployment of GPU-enabled containers (for example, using Intel's ITEX and IPEX images). This is not meant for deployments across a clusters of nodes. There is just a single node (user's laptop or workstation).

Each container runs a Jupyter Notebook server. Ideally, a user could be on a workstation with a single GPU and multiple containers running, with each provided full access to the GPU. Notebook workloads are typically very bursty, so container A may run a notebook cell that is very GPU intensive while container B is idle. In cases where both containers are simultaneously requesting GPU acceleration, ideally that would be handled the same way (or close to the same way) as two applications running directly on the host OS requesting GPU resources.

tkatila · 2024-07-01T06:30:52Z

@frenchwr sharedDevNum is the option you would most likely want. Any container requesting the gpu.intel.com/i915 resource with sharedDevNum > 1 will get "unlimited" access to the GPU. "unlimited" in a sense that there's no hard partitioning etc. in works. Obviously if two containers try to run on the same GPU, they will fight for the same resources (execution time, memory).

frenchwr · 2024-07-01T15:15:01Z

@tkatila Thanks for clarifying! I agree this sounds like the way to go. A few more follow up questions:

Am I right that running with resource management disabled (default behavior) would make the most sense for our use case?
Is there any performance impact on the number you choose for sharedDevNum? For example, using 2 vs. 10 vs. 100. I guess not if there is no partitioning of the GPU resources but just want to confirm. Is there any reason not to choose an arbitrarily large number if our goal is to expose the full GPU to each container?

tkatila · 2024-07-01T18:37:59Z

Am I right that running with resource management disabled (default behavior) would make the most sense for our use case?

Yes, that's correct, keep it disabled. To enable resource management you would also need another k8s component (GPU Aware Scheduling, or GAS). It's setup requires some hassle and I don't see any benefit from it in your case.

Is there any performance impact on the number you choose for sharedDevNum? For example, using 2 vs. 10 vs. 100. I guess not if there is no partitioning of the GPU resources but just want to confirm. Is there any reason not to choose an arbitrarily large number if our goal is to expose the full GPU to each container?

I don't think we have any guide for selecting the number, but something between 10 and 100 would be fine. The downside with an extremely large number is that it might incur some extra CPU and network bandwidth utilization. GPU plugin will detect the number of GPUs, multiply the number with the sharedDevNum and then create duplicate resources for the node. Carrying all those resources in resource registration and during scheduling will have some minor effect, but if the sharedDevNum is within a sensible range, the effect shouldn't be noticeable.

misohu mentioned this issue Jul 17, 2024

Familiarize with Intel DSS environment canonical/data-science-stack#145

Closed

frenchwr mentioned this issue Aug 26, 2024

Understanding and controlling multi-GPU behavior #1815

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Options for GPU Sharing between Containers Running on a Workstation #1769

Options for GPU Sharing between Containers Running on a Workstation #1769

frenchwr commented Jun 27, 2024

eero-t commented Jun 28, 2024 •

edited

Loading

frenchwr commented Jun 28, 2024

tkatila commented Jul 1, 2024

frenchwr commented Jul 1, 2024

tkatila commented Jul 1, 2024

Options for GPU Sharing between Containers Running on a Workstation #1769

Options for GPU Sharing between Containers Running on a Workstation #1769

Comments

frenchwr commented Jun 27, 2024

eero-t commented Jun 28, 2024 • edited Loading

frenchwr commented Jun 28, 2024

tkatila commented Jul 1, 2024

frenchwr commented Jul 1, 2024

tkatila commented Jul 1, 2024

eero-t commented Jun 28, 2024 •

edited

Loading