Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Automatic GPU Switch #845

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Conversation

Steel-skull
Copy link

@Steel-skull Steel-skull commented Oct 30, 2024

Docker Windows GPU Passthrough

[this is not fully tested as im waiting for a gpu to come in]

Automated GPU management solution for Windows in Docker containers with NVIDIA GPU passthrough support. This project provides scripts and configurations to dynamically manage GPU binding between host and Docker containers, with support for multiple GPUs and audio devices.

Prerequisites

  • Unraid server (or Linux system with Docker)
  • NVIDIA GPU(s)
  • Docker and Docker Compose
  • VFIO-PCI support in kernel
  • NVIDIA drivers installed on host

Quick Start

  1. Clone the repository:
git clone https://github.com/yourusername/docker-windows-gpu.git
cd docker-windows-gpu
  1. Configure your environment:
# Set to your GPU ID(s), PCI address(es), or 'none'
add NVIDIA_VISIBLE_DEVICES=0
  1. Start the container:
docker-compose up -d

Configuration

Environment Variables

  • NVIDIA_VISIBLE_DEVICES: Specify GPU(s) to use
    • Single GPU: NVIDIA_VISIBLE_DEVICES=0
    • Multiple GPUs: NVIDIA_VISIBLE_DEVICES=0,1
    • PCI addresses: NVIDIA_VISIBLE_DEVICES=0000:03:00.0,0000:04:00.0
    • No GPU: NVIDIA_VISIBLE_DEVICES=none

Docker Compose

The provided docker-compose.yml includes all necessary configurations for:

  • GPU passthrough
  • RDP access
  • KVM support
  • Network management
  • Persistent storage

Usage

Manual GPU Management (until I find a way to run pre and post stop, use it with user scripts)

Bind GPU to container:

NVIDIA_VISIBLE_DEVICES=0 /boot/config/plugins/user.scripts/gpu-switch.sh start windows

Release GPU:

NVIDIA_VISIBLE_DEVICES=0 /boot/config/plugins/user.scripts/gpu-switch.sh stop windows

Script Details

The gpu-switch.sh script handles:

  1. GPU detection and validation
  2. Driver management (NVIDIA ⟷ VFIO-PCI)
  3. Audio device pairing
  4. Docker container configuration
  5. Error handling and logging

gpu switch version: 0.1

# Without GPU:
NVIDIA_VISIBLE_DEVICES="" ./gpu-switch.sh start container_name

# With single GPU:
NVIDIA_VISIBLE_DEVICES="0" ./gpu-switch.sh start container_name

# With multiple GPUs:
NVIDIA_VISIBLE_DEVICES="0,1" ./gpu-switch.sh start container_name

# With PCI addresses:
NVIDIA_VISIBLE_DEVICES="0000:03:00.0,0000:04:00.0" ./gpu-switch.sh start container_name

# Explicitly disable GPU:
NVIDIA_VISIBLE_DEVICES="none" ./gpu-switch.sh start container_name
@Steel-skull
Copy link
Author

have to modify the docker compose side as I was under the impression it supported pre-start and post-stop scripts but I misread and its post-start and pre-stop, ill need to find a new way to work this, script still works and can be implemented using user scripts in unraid.

[again tho im waiting on a gpu so i haven't been able to fully test it]

@Steel-skull Steel-skull mentioned this pull request Oct 30, 2024
@kroese
Copy link
Contributor

kroese commented Nov 9, 2024

Very interesting work!! Did you already receive your GPU to test it?

@JosueIsrael-prog
Copy link

Very good

@maksymdor
Copy link

Hmm! Interesting

if ! check_gpu_needed; then
log "Continuing without GPU management"
exit 0
fi
Copy link

@vinkay215 vinkay215 Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of listing all containers, you can directly check the existence of the container using docker container inspect, which is more efficient since it only checks the specified container without scanning the entire list. Here’s how to replace that line:

if ! docker container inspect "$CONTAINER_NAME" > /dev/null 2>&1; then
    error_exit "Container $CONTAINER_NAME does not exist"
fi

The docker container inspect command returns an error if the container does not exist, so you can use it to directly verify the container’s existence without listing all containers.

Copy link
Author

@Steel-skull Steel-skull Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill take a look at implementing this thanks for the ideas

}

# Convert any GPU identifier to PCI address
convert_to_pci_address() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporating these improvements, here’s the final optimized convert_to_pci_address fu


convert_to_pci_address() {
    local device="$1"
    local gpu_address=""

    if [[ "$device" =~ ^[0-9]+$ || "$device" =~ ^GPU-.*$ ]]; then
        # Convert GPU index or UUID to PCI address
        gpu_address=$(nvidia-smi --id="$device" --query-gpu=gpu_bus_id --format=csv,noheader 2>/dev/null | tr -d '[:space:]')
    else
        # Direct PCI address provided
        gpu_address="$device"
    fi

    # Check for valid output
    if [ -z "$gpu_address" ]; then
        error_exit "Failed to get PCI address for device: $device"
    fi

    # Standardize format
    echo "$gpu_address" | sed -e 's/0000://' -e 's/\./:/g'
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill take a look at implementing this thanks for the ideas on this as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged with main

@tl123987
Copy link

tl123987 commented Nov 12, 2024

share failed? is there something wrong?

@Steel-skull
Copy link
Author

Very interesting work!! Did you already receive your GPU to test it?

sadly no the one I ordered from ebay was extremely unstable (kept crashing my server when using it with ollama) so im waiting for my money back

@Steel-skull
Copy link
Author

share failed? is there something wrong?

you will have to expand on this, i dont understand.

@tl123987
Copy link

Looking forward to your completion, thank you, I hope there will be a complete tutorial in the future

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really handy man)

Copy link

@hzxie hzxie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Syntax error in gpu-switch.sh

local gpu_address=$(convert_to_pci_address "$device")
if [ -z "$gpu_address" ]; then
error_exit "Failed to get PCI address for device: $device"
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

./gpu-switch.sh: line 64: syntax error near unexpected token `}'
./gpu-switch.sh: line 64: `        }'

if [ -z "$gpu_audio_address" ]; then
log "Warning: No audio device found for GPU $gpu_address"
continue
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

./gpu-switch.sh: line 75: syntax error near unexpected token `}'
./gpu-switch.sh: line 75: `        }'

@hzxie
Copy link

hzxie commented Nov 17, 2024

I got the following error after running gpu-switch.sh.

[root@SLab-Mocap-Server 11 ]$ NVIDIA_VISIBLE_DEVICES=0 ./gpu-switch.sh start windows-11
GPU-SWITCH [2024-11-17 13:49:30]: Warning: No audio device found for GPU 000001:00:0
GPU-SWITCH [2024-11-17 13:49:30]: ERROR: No valid GPU devices found

Here's the output of lspci -v | grep -A 15 " NVIDIA "

01:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ZOTAC International (MCO) Ltd. Device 2503
        Flags: bus master, fast devsel, latency 0, IRQ 133
        Memory at a4000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 90000000 (64-bit, prefetchable) [size=256M]
        Memory at a0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at 5000 [size=128]
        Expansion ROM at a5000000 [virtual] [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, IntMsgNum 0
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [258] L1 PM Substates
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
--
01:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device 2503
        Flags: bus master, fast devsel, latency 0, IRQ 17
        Memory at a5080000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, IntMsgNum 0
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel

01:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1) (prog-if 30 [XHCI])
        Subsystem: ZOTAC International (MCO) Ltd. Device 2503
        Flags: fast devsel, IRQ 128
        Memory at a2000000 (64-bit, prefetchable) [size=256K]
        Memory at a2040000 (64-bit, prefetchable) [size=64K]
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, IntMsgNum 0
        Capabilities: [b4] Power Management version 3
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci

01:00.3 Serial bus controller: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device 2503
        Flags: bus master, fast devsel, latency 0, IRQ 130
        Memory at a5084000 (32-bit, non-prefetchable) [size=4K]
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, IntMsgNum 0
        Capabilities: [b4] Power Management version 3
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: nvidia-gpu
        Kernel modules: i2c_nvidia_gpu

02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 01)
        Subsystem: ASRock Incorporation Device 8125
        Flags: bus master, fast devsel, latency 0, IRQ 17
        I/O ports at 4000 [size=256]
        Memory at a5320000 (64-bit, non-prefetchable) [size=64K]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants