Skip to content

Mamba Server Setup

William Guimont-Martin edited this page Aug 6, 2024 · 50 revisions

This guide's purpose is to give a quick overview of how to install everything required for the Mamba Server.

Installation

  1. Install Fedora Server

  2. Install NVidia drivers

    # https://www.reddit.com/r/Fedora/comments/12ju2sg/i_need_help_with_installing_nvidia_drivers_to/
    sudo dnf install https://mirrors.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm https://mirrors.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
    sudo dnf update -y
    sudo dnf install akmod-nvidia -y
    sudo dnf install xorg-x11-drv-nvidia-cuda -y
    sudo reboot now
    
  3. Alternative Install NVidia drivers via dnf module

    Links to official install procedure supported by NVidia:
    https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#fedora

    sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/$distro/x86_64/cuda-$distro.repo
    

    Replace $distro with the latest available version matching the server, currently fedora39.

    Remark: often NVidia can be late by updating the version number of their repository. For example the current version of Fedora is 40 and the latest repo version is fedora39. Although the repository version is anterior, there will be no issue installing that version until a new one is made available available.

    sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/fedora39/x86_64/cuda-fedora39.repo
    sudo dnf makecache
    

    cuda-fedora39-x86_64 should appear in the list of enabled repositories, along with an entry in /etc/yum.repos.d/cuda-fedora39.repo.

    Next we install the nvidia driver and cuda toolkit via its now available module. Choose the appropriate version and between preferred closed or opensource module version. That will install a dkms driver that will be automatically updated with every kernel update.
    Choose between the latest version of the driver latest-dkms that will be updated with dnf update, of pin a specific version for example 555-dkms. Do the same for cuda toolkit via the meta-package cuda-toolkit or target a specific verison of cuda.

    If a previously installed kernel via rpmfusion is installed, remove everything first.

    sudo dnf autoremove akmod-nvidia xorg-x11-drv-nvidia-*
    

    Then install the new drivers via its module.

    sudo dnf module list
    sudo dnf module install nvidia-driver:latest-dkms
    

    Check that the dkms module is built successfuly for all installed kernels.

    $ sudo dkms status 
    nvidia/555.42.02, 6.8.10-300.fc40.x86_64, x86_64: installed
    

    Finally, proceed with the installation of the cuda toolkit

    sudo dnf install cuda-toolkit
    

    Select the default cuda version.

    sudo update-alternatives --display cuda
    sudo update-alternatives --config cuda
    

    Set nvcc and other utilities's PATH in
    bashrc or bash_profile per user
    /etc/environment or /etc/profile.d/cuda.sh for all users

    export CUDACXX=/usr/local/cuda/bin/nvcc
    export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
    

    Make the driver persistent

    sudo systemctl enable nvidia-persistenced.service
    sudo systemctl start nvidia-persistenced.service
    

    Before rebooting update initial ramdisk.

    sudo dracut -f
    sudo reboot now
    
  4. Install Munge

    export MUNGEUSER=1111
    sudo groupadd -g $MUNGEUSER munge
    sudo useradd  -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge  -s /sbin/nologin munge
    export SLURMUSER=1121
    sudo groupadd -g $SLURMUSER slurm
    sudo useradd  -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm  -s /bin/bash slurm
    sudo dnf install munge munge-devel munge-libs -y
    
    sudo dnf install rng-tools -y
    rngd -r /dev/urandom
    
    sudo mungekey
    sudo chown munge:munge /etc/munge/munge.key
    sudo chmod 400 /etc/munge/munge.key
    sudo chown -R munge: /etc/munge/ /var/log/munge/
    sudo systemctl enable munge.service
    sudo systemctl start munge.service
    
  5. Install Slurm

    sudo dnf openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad -y
    sudo dnf install libcgroup libcgroup-tools libcgroup-devel mariadb mariadb-devel mariadb-server -y
    sudo dnf install autoconf automake perl -y
    dnf install dbus-devel -y
    
    # Build Slurm RPM
    sudo su
    cd
    wget https://download.schedmd.com/slurm/slurm-24.05.0-0rc1.tar.bz2
    rpmbuild -ta slurm-24.05.0-0rc1.tar.bz2
    cd rpmbuild/RPMS/x86_64/
    dnf --nogpgcheck localinstall *.rpm -y
    # To reinstall if you recompile
    dnf --nogpgcheck reinstall *.rpm -y
    
  6. Configure Slurm: Copy the configs from norlab-ulaval/dotfiles-mamba-server to /etc/slurm

  7. Ensure the permissions are correct

    mkdir /var/spool/slurmctld
    chown slurm: /var/spool/slurmctld
    chmod 755 /var/spool/slurmctld
    touch /var/log/slurmctld.log
    chown slurm: /var/log/slurmctld.log
    touch /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log
    chown slurm: /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log
    
  8. Check that slurm is correctly configured: slurmd -C

  9. Start the services

    systemctl enable slurmd.service
    systemctl start slurmd.service
    systemctl status slurmd.service
    
    systemctl enable slurmctld.service
    systemctl start slurmctld.service
    systemctl status slurmctld.service
    
  10. Setup accounting

    systemctl enable mariadb.service
    systemctl start mariadb.service
    
    # Inspired by: https://github.com/Artlands/Install-Slurm/blob/master/README.md#setting-up-mariadb-database-master
    mysql
    # Change password in the following line
    > create user 'slurm'@'localhost' identified by '${DB_USER_PASSWORD}'; grant all on slurm_acct_db.* TO 'slurm'@'localhost'; create database slurm_acct_db;
    > GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost' IDENTIFIED BY '1234' with grant option;
    > SHOW VARIABLES LIKE 'have_innodb';
    > FLUSH PRIVILEGES;
    > CREATE DATABASE slurm_acct_db;
    > quit;
    
    # Verify you can login
    mysql -p -u slurm
    
  11. Copy /etc/my.cnf.d/innodb.cnf from norlab-ulaval/dotfiles-mamba-server

  12. Restart mariadb

    systemctl stop mariadb
    mv /var/lib/mysql/ib_logfile? /tmp/
    mv /var/lib/mysql/* /tmp/
    systemctl start mariadb
    
  13. Check the ownership of some files

    chown slurm slurmdbd.conf
    touch /var/log/slurmctld.log
    chown slurm /var/log/slurmctld.log
    chown slurm slurm*
    
  14. Check if slurmdb can start correctly using slurmdbd -D -vvv

  15. Start the services

    systemctl enable slurmdbd
    systemctl start slurmdbd
    systemctl status slurmdbd
    
    systemctl enable slurmctld.service
    systemctl start slurmctld.service
    systemctl status slurmctld.service
    
  16. Add accounts

    sudo sacctmgr add account norlab Description="Norlab mamba-server" Organization=norlab
    sacctmgr add user wigum Account=norlab
    
  17. If the Slurm services crash at startup, add the following lines to each slurm service (slurmctl, slurmdbd and slurmd) using systemctl edit slurmXXX.service

    [Service]
    Restart=always
    RestartSec=5s
    
  18. Setup the LVM: Resize LVM.

    lvextend -l +100%FREE fedora
    xfs_growfs /dev/fedora/root
    # Verify the fs took all the place
    lsblk -f
    
  19. Install nvidia-container-toolkit: nvidia-container-toolkit and setup for container use: cdi-support

    sudo dnf install dkms
    curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
         sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
    sudo dnf config-manager --enable nvidia-container-toolkit-experimental
    sudo dnf install nvidia-container-toolkit -y
    sudo sed -i 's/^#no-cgroups = false/no-cgroups = true/;' /etc/nvidia-container-runtime/config.toml
    
    # Create a systemd service to run the following line at startup
    sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
    
    sudo reboot now
    
  20. Verify that the network interface is correctly configured

    sudo dnf install speedtest-cli -y
    speedtest-cli --secure
    # Should print approximately 1000Mb/s
    

    If not, change the ethernet port and enable the other interface

    # /etc/NetworkManager/system-connections/enp68s0.nmconnection
    # Enable it with
    autoconnect=true
    
  21. Install a docker version that supports buildx plugin: Install docker on fedora

Add a new user

sudo useradd -c 'Full name' -m <username> -G docker
sudo passwd <username>

We also recommend setting the following env variable in their .bashrc:

export SQUEUE_FORMAT="%.18i %.9P %.25j %.8u %.2t %.10M %.6D %.20e %b %.8c"
  1. Cronjob to clean podman cache

Add the following cronjob to sudo crontab -u root -e:

cat /etc/passwd | grep /bin/bash | awk -F: '{ print $1}' | while read user; do echo "Processing user $user..." && sudo -u $user -H bash -c "cd && podman system prune -af"; done

Running jobs

As users do not have root access on the Mamba Server, every project should be ran in a container. We recommend using podman.

First, on your host machine, write a Dockerfile to run your project inside a container. Then, build and test that everything works on your machine before testing it on the server.

We recommend putting your data in a directory and to symlink it to the data folder of your project. We describe here how to add volumes to avoid copying the data in the container.

# Build the image
buildah build --layers -t myproject .

# Run docker image
export CONFIG=path/to/config> # for example `config/segsdet.yaml`
export CUDA_VISIBLE_DEVICES=0 # or `0,1` for specific GPUs, will be automatically set by SLURM

podman run --gpus all --rm -it --ipc host \
  -v .:/app/ \
  -v /app/data \
  -v ./data/coco/:/app/data/coco \
  -v /dev/shm:/dev/shm \
  myproject bash -c "python3 tools/train.py $CONFIG --gpu $CUDA_VISIBLE_DEVICES"

After you verified everything works on your machine, copy the code on the server and write a Slurm job script.

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=16
#SBATCH --time=10-00:00
#SBATCH --job-name=$NAME
#SBATCH --output=%x-%j.out

cd ~/myproject || exit
buildah build --layers -t myproject .

export CONFIG=path/to/config> # for example `config/segsdet.yaml`

# Notice there is no -it option
podman run --gpus all --rm --ipc host \
  -v .:/app/ \
  -v /app/data \
  -v ./data/coco/:/app/data/coco \
  -v /dev/shm:/dev/shm \
  myproject bash -c "python3 tools/train.py $CONFIG --gpu $CUDA_VISIBLE_DEVICES"

Then, you can queue the job using sbatch job.sh and see the queued jobs using squeue. For an easier experience, you can use willGuimont/sjm. After you've verified this works, use the following code to kill the container when the slurm job stops.

# Notice the -d option to detach the process, and no -it option
container_id=$(
  podman run --gpus all --rm -d --ipc host \
  -v .:/app/ \
  -v /app/data \
  -v ./data/coco/:/app/data/coco \
  -v /dev/shm:/dev/shm \
  myproject bash -c "python3 tools/train.py $CONFIG --gpu $CUDA_VISIBLE_DEVICES"
)

stop_container() {
  podman logs $container_id
  podman container stop $container_id
}

trap stop_container EXIT
echo "Container ID: $container_id"
podman wait $container_id

You can then run the job using:

sbatch job.sh

And see the running jobs using:

squeue

Remote connection

  1. SSH and X11 forwarding with GLX support

    First make sure X11 forwarding is enabled server side. Check these two lines in /etc/ssh/sshd_config

    X11Forwarding yes
    X11DisplayOffset 10
    

    If they were not, enable them and restart sshd

    sudo systemctl restart sshd
    

    Make sure xauth is available on the server or install it

    sudo dnf install xorg-x11-xauth
    

    Next install basic utilities to test GLX and Vulkan capabilities on the server. We'll need them to benchmark the remote connection's performance.

    sudo dnf install glx-utils vulkan-tools
    

    If you encounter problems, make sure on the client-side the server is allowed to display.
    Make sure the ip of the server is valid. + to add - to remove from the trusted list.

    xhost + 132.203.26.231
    

    Connect to the server from your client using ssh.
    Use the options -X or -Y to redirect X11 via the ssh tunnel. The redirection works despite the fact that the server is headless. But Xorg must be installed.
    The -X option will automatically update the DSIPLAY env variable. Note: the ip of the server is subject to change, make sure to have the last updated.

    ssh -X user@132.203.26.231
    

    Test that X redirection is working by executing a simple X graphical application.

    $ xterm
    

    Test GLX support with glxinfo

    glxinfo
    

    Test what GLX implementation is used by default

    $ glxinfo | grep -i vendor
    server glx vendor string: SGI
    client glx vendor string: Mesa Project and SGI
        Vendor: Mesa (0xffffffff)
    OpenGL vendor string: Mesa
    

    Check both NVidia and Mesa implementations work for GLX passthrough.

    __GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo | grep -i vendor
    __GLX_VENDOR_LIBRARY_NAME=mesa glxinfo | grep -i vendor
    

    Choose the best implementation between Nvidia and Mesa.
    On Nvidia GPUs NVidia's implementation gives the best results.

    export __GLX_VENDOR_LIBRARY_NAME=nvidia
    glxgears
    

    For Vulkan aplications the process is similar

    vulkaninfo
    VK_DRIVER_FILES="/usr/share/vulkan/icd.d/nvidia_icd.x86_64.json" vkcube
    

Norlab's Robots

Protocols

Templates

Resources

Grants

Datasets

Mapping

Deep Learning

ROS

Ubuntu

Docker (work in progress)

Tips & tricks

Clone this wiki locally