-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create render
group for Ubuntu >= 20, as per ROCm documentation
#90
Comments
Had a similar issue when I was building a Docker image with ROCm support. The ProblemA non-root user can't access the GPU resources and has to run commands as GroupsA user inside the docker container has to be a member of the
SolutionUsing Docker Bash ScriptCreate an #!/bin/bash
sudo groupadd --gid $RENDER_GID render
sudo usermod -aG render $USERNAME
sudo usermod -aG video $USERNAME
exec "$@" DockerfileInside the Dockerfile we create a new user and copy the entrypoint.sh script to the image. A basic example: FROM ubuntu
ENV USERNAME=rocm-user
ARG USER_UID=1000
ARG USER_GID=$USER_UID
RUN groupadd --gid $USER_GID $USERNAME \
&& useradd --uid $USER_UID --gid $USER_GID -m $USERNAME \
&& echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME \
&& chmod 0440 /etc/sudoers.d/$USERNAME
COPY entrypoint.sh /tmp
RUN chmod 777 /tmp/entrypoint.sh
USER $USERNAME
ENTRYPOINT ["/tmp/entrypoint.sh"]
CMD ["/bin/bash"] docker build -t rocm-image . TerminalWhen starting the container pass the export RENDER_GID=$(getent group render | cut -d: -f3) && docker run -it --device=/dev/kfd --device=/dev/dri -e RENDER_GID --group-add $RENDER_GID rocm-image /bin/bash VS Code DevcontainerJust add the following code to {
"build": { "dockerfile": "./Dockerfile" }
"overrideCommand": false,
"initializeCommand": "echo \"RENDER_GID=$(getent group render | cut -d: -f3)\" > .devcontainer/devcontainer.env",
"containerEnv": { "HSA_OVERRIDE_GFX_VERSION": "10.3.0" },
"runArgs": [
"--env-file=.devcontainer/devcontainer.env",
"--device=/dev/kfd",
"--device=/dev/dri"
]
} |
On one of our machines GID of |
Hi @romintomasetti @sergejcodes, thank you for both reporting this issue and providing a detailed solution to the problem. This has been addressed in our newer images by defaulting to a root user in order to maintain access to GPU resources. Please let me know if we can close out this issue. |
@harkgill-amd in many cases, clusters (Kubernetes) have security policies that prevent containers running as root, this limitation will prevent MANY companies from being able to use AMD GPUs for their AI workloads. In Kubernetes, this is likely something that your https://github.com/ROCm/k8s-device-plugin can resolve by checking the host's However, I feel there must be a more clean solution, because running Nvidia GPUs have no such problems either on local docker or their Kubernetes device plugin. I would check what they are doing, but it might be something like they have every device mount owned by a constant GID (e.g. Here is the related issue on the AMD Device Plugin repo: ROCm/k8s-device-plugin#39 |
Also, for context, when using Nvidia GPUs, you don't mount them with the For reference, here is information about the |
Although, I guess the real question is why AMD ever thought it was a good idea to not have a static GID for the |
Hi @harkgill-amd, I don't think this should be closed as the inherent problem with using a non-root user is still prevalent, and there isn't a clean solution for this. |
@thesuperzapper and @gigabyte132, thank you for the feedback. We are currently exploring the possibility of using
This configuration grants users read and write access to AMD GPU resources. From there, you can pass access to these devices into a container by specifying I ran this setup with the |
@harkgill-amd while changing the permissions on the host might work, I will note that this does not seem to be required for Nvidia GPUs. I imagine that this is because they mount the device paths specifically because Because specifying each device is obviously a pain for end users, they added a custom Also want to highlight the differences between the Kubernetes Device Plugin for AMD/Nvidia, as this is where most people are using lots of GPUs, and the permission issues also occur on AMD but not Nvidia: |
@harkgill-amd after a lot of testing, it seems like the major container runtimes (including docker and containerd) don't actually change the permissions of devices mounted with For example, you would expect the following command to mount docker run --device /dev/dri/card1 ubuntu ls -la /dev/dri
# OUTPUT:
# total 0
# drwxr-xr-x 2 root root 60 Oct 24 18:52 .
# drwxr-xr-x 6 root root 360 Oct 24 18:52 ..
# crw-rw---- 1 root 110 226, 1 Oct 24 18:52 card1 This is also seemingly happens on Kubernetes despite the AMD Device plugin requesting that the container be given |
@harkgill-amd We need to find a generic solution which allows a non-root container to be run on any server (with a default install of AMD drivers) This problematic because there is no standard GID for the Note, it seems like ubuntu has a default udev rule under Possible solutions
|
Initial issue
As stated in https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation_new.html#setting-permissions-for-groups, for Ubuntu 20 and above, the user needs to be part of the
render
group.Therefore, we need to create the
render
group in the docker image. The following would work:RUN groupadd render
We might also want to update the documentation because the
docker run
command should contain--group-add render
for Ubuntu 20 and above.Update - 10th June 2022
I made the following experiments. The user I'm logged in on the host is part of the
render
group. My user ID is1002
.works because it runs as
root
(with user ID 0 on the host) andwill not work with
Unable to open /dev/kfd read-write: Permission denied
.will not work because inside of
rocm/dev-ubuntu-20.04:5.1
there is no render group.docker run --rm --user=1002 --group-add $(getent group render | cut -d':' -f 3) --device=/dev/kfd rocm/dev-ubuntu-20.04:5.1 rocminfo
will work again.
Therefore, I see 2 ways of fixing this.
ARG
) but the image would not be portable.--group-add $(getent group render | cut -d':' -f 3)
.The text was updated successfully, but these errors were encountered: