A COTS GPUDirect RDMA data transer demo between two nodes (Back-to-Back) using Mellanox ConnectX-6 Dx and Nvidia Quadro M4000 GPU's over UCX.
Clone the project:
git clone https://github.com/acoustic-warfare/data-transfer.git
Move to the project folder:
cd data-transfer
1. NICs
- ConnectX-6 Lx
- ConnectX-6 Dx (Ours)
- ConnectX-6
- ConnectX-5
- ConnectX-4 Lx
2. GPUs
Nvidia:
- Tesla™ (any)
- Quadro™ K-Series
- Quadro™ P-Series (Ours)
-
NIC In order to use the network interface cards, the package NVIDIA Firmware Tools:
mft
must be installed for firmware management together with correct drivers for LinuxMLNX_OFED
. -
GPU In order to utilize GPUDirect RDMA, the package
nvidia-peer-mem
must be installed. MLNX_OFED 5.1 -
Data Transfer To be able to control the host channel adapter (HCA), the HPC networking library
ucx
is required with support for GPUDirect RDMA.
NVIDIA Drivers
It is preffered to install display drivers using the distribution's native package management tool, i.e apt
. If not installed already, NVIDIA display drivers can be installed from NVIDIA Download Center.
1. CUDA Runtime and Toolkit
To install CUDA Toolkit CUDA 11.7
from Nvidia on Ubuntu 20.04:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu2004-11-7-local_11.7.0-515.43.04-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-7-local_11.7.0-515.43.04-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2004-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda
2. MLNX_OFED NIC Firmware (v2.1-x.x.x or later)
To install Linux drivers for ethernet and infiniband adapters, MLNX_OFED
. See Download Center or for Ubuntu 20.04:
wget http://www.mellanox.com/downloads/ofed/MLNX_OFED-5.6-2.0.9.0/MLNX_OFED_LINUX-5.6-2.0.9.0-ubuntu20.04-x86_64.tgz
tar -xvf MLNX_OFED_LINUX*
cd MLNX_OFED_LINUX*
sudo ./mlnxofedinstall --upstream-libs --dpdk --force
sudo /etc/init.d/openibd restart
3. GPUDirect RDMA
To install nvidia-peer-mem
on Ubuntu 20.04:
git clone https://github.com/Mellanox/nv_peer_memory.git
cd nv_peer_memory
./build_module.sh
cd /tmp
tar xzf /tmp/nvidia-peer-memory_*
cd nvidia-peer-memory-*
dpkg-buildpackage -us -uc
sudo dpkg -i /tmp/nvidia-peer-memory_*.deb
sudo service nv_peer_mem restart
3.1 GDRCopy (OPTIONAL)
To install gdrcopy
git clone https://github.com/NVIDIA/gdrcopy.git
cd gdrcopy
sudo apt install check libsubunit0 libsubunit-dev
mkdir final
make prefix=final CUDA="$CUDA_HOME" all install
3.2 Performance Benchmark (OPTIONAL)
sudo apt update -y
sudo apt install -y libpci-dev libibumad
git clone https://github.com/linux-rdma/perftest.git
cd perftest
./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j
sudo make install
4. UCX
To install ucx
on Ubuntu 20.04, python bindings are required together with python3+ packages. To maintain a working environment, installing conda is highly recomended.
4.1 Conda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86.sh
bash Miniconda3-latest-Linux-x86.sh
Disable auto-init
conda config --set auto_activate_base false
Recreate an environment inside data-transfer
conda env create -f environment.yml
Activate and enter the newly created environment
conda activate data-transfer
...
(data-transfer) $
4.2 Development packages for the program
Ensure that the latest updates are installed.
sudo apt update -y
Install development packages in Ubuntu 20.04. libnuma-dev
, cython3
.
sudo apt install -y libnuma-dev cython3
4.3 Install dependencies
Some python modules require more dependencies to run.
pip install pynvml cpython
To build the project requires having the gcc
compiler.
sudo apt install gcc cmake
The adapters need assigned IP addresses, we used the GNOME Network Manager GUI and assigned the IPv4 addresses 10.0.0.x
format with netmask 255.255.255.0
To run the demo, make sure both network adapters are connected and working properly by performing a ping. It is important to specify the correct interface when pinging:
ping -I <LOCAL_INTERFACE> <REMOTE_ADDRESS>
Eg:
(data-transfer) scarecrow@node1:~/data-transfer$ ping -I ens4f0np0 10.0.0.4
PING 10.0.0.4 (10.0.0.4) from 10.0.0.3 ens4f0np0: 56(84) bytes of data.
64 bytes from 10.0.0.4: icmp_seq=1 ttl=64 time=0.110 ms
64 bytes from 10.0.0.4: icmp_seq=2 ttl=64 time=0.112 ms
64 bytes from 10.0.0.4: icmp_seq=3 ttl=64 time=0.109 ms
^C
--- 10.0.0.4 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2029ms
rtt min/avg/max/mdev = 0.109/0.110/0.112/0.001 ms
To find out which interface to use:
(data-transfer) scarecrow@node1:~/data-transfer$ ifconfig
eno1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether 54:bf:64:6a:91:31 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 16 memory 0x92f00000-92f20000
# The one we used: Mellanox ConnectX-6 Dx (first port)
-----------------
-> ens4f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.0.0.3 netmask 255.255.255.0 broadcast 10.0.0.255
inet6 fe80::4e4d:abee:c38d:4b6f prefixlen 64 scopeid 0x20<link>
ether 10:70:fd:60:c1:cc txqueuelen 1000 (Ethernet)
RX packets 1021 bytes 63776 (63.7 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 841 bytes 52495 (52.4 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
-----------------
ens4f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.0.0.1 netmask 255.255.255.0 broadcast 10.0.0.255
inet6 fe80::40ba:65b6:cb4e:4521 prefixlen 64 scopeid 0x20<link>
ether 10:70:fd:60:c1:cd txqueuelen 1000 (Ethernet)
RX packets 36 bytes 4419 (4.4 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 23 bytes 3072 (3.0 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 1782 bytes 168796 (168.7 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1782 bytes 168796 (168.7 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
wlx04421a4d9e71: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 81.236.101.159 netmask 255.255.254.0 broadcast 81.236.101.255
inet6 fe80::7207:baa6:6cea:6bd3 prefixlen 64 scopeid 0x20<link>
ether 04:42:1a:4d:9e:71 txqueuelen 1000 (Ethernet)
RX packets 66840 bytes 48821412 (48.8 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 61150 bytes 44974817 (44.9 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
If the connection is working building can be started
Building the demo will create five executable files: run
, web_gauge
, fpga_emulator
, transfer_real
and transfer_synthetic
.
run
is the main data-transfer program.fpga_emulator
is a program to emulate incoming datastream (data generator) to send from one GPU to the other.web_gauge
is a web GUI showing a speedometer of the current transfer rate.transfer_real
is a demo utilizing incoming data from a socket, eg.fpga_emulator
and displays the bandwidth toweb_gauge
.transfer_synthetic
showcase the maximum internal transferspeed within a program, ie. the program creates an array of ones, default is2^26
bytes and writes to a remote region as fast as possible. The programweb_gauge
is also used to display the results.
To build all programs:
make
To only build fpga_emulator
:
make fpga
To only build web_gauge
:
make web_gauge
To cleanup and removal of executables:
make clean
Running the demo requires that both computers are connected via infiniband and correct software and modules are installed and running.
Start the receiving end (Will receive a continuous flow of data)
./run --server -a <ADDRESS> -p <PORT>
Start the receiving end (Will receive a continuous flow of data)
./run --client -a <SERVER_ADDRESS> -p <PORT>
Start the FPGA device sending data to a UDP socket or run the the emulator:
./fpga_emulator -a "localhost" -p <PORT>
Start the data-transfer demo:
./web_gauge --web-host <IP> -w <PORT> -a "localhost" -p 30000
View the running application running on specified IP
and PORT
:
firefox http://IP:PORT/
If the network adapters stops working after a period of time can be a result of insufficient cooling. The ConnectX cards require continuous cooling, we found that after exceeding 110°C, the modules were being unloaded from the system mlx5_core
followed by many warnings when shown with dmesg
. After exceeding the 120°C mark, the cards were physically shutdown by an onboard safety mechanism resulting in reloading the kernel modules was impossible and required a complete restart of the system.
From Nvidia https://docs.nvidia.com/networking/display/ConnectX5EN/Thermal+Sensors
The adapter card incorporates the ConnectX IC which operates in the range of temperatures between 0C and 105C.
There are three thermal threshold definitions which impact the overall system operation state:
Warning – 105°C: On managed systems only: When the device crosses the 100°C threshold, a Warning Threshold message will be issued by the management SW, indicating to system administration that the card has crossed the Warning threshold. Note that this temperature threshold does not require nor lead to any action by hardware (such as adapter card shutdown).
Critical – 115°C: When the device crosses this temperature, the firmware will automatically shut down the device.
Emergency – 130°C: In case the firmware fails to shut down the device upon crossing the Critical threshold, the device will auto-shutdown upon crossing the Emergency (130°C) threshold.
The card's thermal sensors can be read through the system’s SMBus. The user can read these thermal sensors and adapt the system airflow in accordance with the readouts and the needs of the above-mentioned IC thermal requirements.
To check temperature of the cards install the
To find the cards:
(data-transfer) scarecrow@scarecrow:~/data-transfer$ lspci | grep Mellanox
04:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
04:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
To probe 04:00.0
(requires root privileges):
sudo mget_temp -d 04:00.0
53
If during runtime the following error occurs a solution is to use smaller buffers. We are not entirely sure how this issue is emerging. We found that a buffer size greater than >115MB
would result in the following errors:
(data-transfer) scarecrow@scarecrow:~/data-transfer$ ./transfer_synthetic --transmitter -p 12377 -sp 45004 -a 10.0.0.4 --n-bytes 1000000000
[1658921155.651848] [scarecrow:15571:0] ib_log.c:254 UCX ERROR ibv_reg_mr(address=0x7fe574000000, length=1000000000, access=0xf) failed: Bad address
[1658921155.651869] [scarecrow:15571:0] ucp_mm.c:159 UCX ERROR failed to register address 0x7fe574000000 mem_type bit 0x2 length 1000000000 on md[3]=mlx5_0: Input/output error (md reg_mem_types 0x3)
[1658921155.651874] [scarecrow:15571:0] ucp_request.c:519 UCX ERROR failed to register user buffer datatype 0x8 address 0x7fe574000000 len 1000000000: Input/output error
[scarecrow:15571:0:15571] rndv.c:2378 Assertion `status == UCS_OK' failed
==== backtrace (tid: 15571) ====
0 /home/scarecrow/miniconda3/envs/data-transfer/lib/python3.7/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x2dc) [0x7fe71a27d7cc]
1 /home/scarecrow/miniconda3/envs/data-transfer/lib/python3.7/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_fatal_error_message+0xb8) [0x7fe71a27a708]
2 /home/scarecrow/miniconda3/envs/data-transfer/lib/python3.7/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_fatal_error_format+0x114) [0x7fe71a27a824]
3 /home/scarecrow/miniconda3/envs/data-transfer/lib/python3.7/site-packages/ucp/_libs/../../../../libucp.so.0(ucp_rndv_rtr_handler+0x608) [0x7fe71a331298]
4 /home/scarecrow/miniconda3/envs/data-transfer/lib/python3.7/site-packages/ucp/_libs/../../../../ucx/libuct_ib.so.0(+0x3eecd) [0x7fe6ef785ecd]
5 /home/scarecrow/miniconda3/envs/data-transfer/lib/python3.7/site-packages/ucp/_libs/../../../../libucp.so.0(ucp_worker_progress+0x6a) [0x7fe71a2fecca]
6 /home/scarecrow/miniconda3/envs/data-transfer/lib/python3.7/site-packages/ucp/_libs/ucx_api.cpython-37m-x86_64-linux-gnu.so(+0x290ba) [0x7fe71a3b70ba]
7 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(_PyMethodDef_RawFastCallDict+0x9c) [0x7fe71ea68d7c]
8 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(+0x1ed8ad) [0x7fe71ea698ad]
9 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(PyObject_Call+0x64) [0x7fe71ea69c04]
10 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0x434a) [0x7fe71e8e4aea]
11 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(_PyEval_EvalCodeWithName+0x920) [0x7fe71e99de80]
12 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(_PyFunction_FastCallKeywords+0x8a) [0x7fe71ea690ea]
13 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(+0x64416) [0x7fe71e8e0416]
14 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0x47a6) [0x7fe71e8e4f46]
15 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(+0x6ddea) [0x7fe71e8e9dea]
16 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(_PyFunction_FastCallDict+0x2e3) [0x7fe71ea694f3]
17 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(_PyObject_Call_Prepend+0xf7) [0x7fe71ea6bc17]
18 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(_PyObject_FastCallKeywords+0xd9) [0x7fe71ea69dd9]
19 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(+0xf409b) [0x7fe71e97009b]
20 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(_PyMethodDef_RawFastCallDict+0xf1) [0x7fe71ea68dd1]
21 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(_PyCFunction_FastCallDict+0x29) [0x7fe71ea699d9]
22 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0x7e51) [0x7fe71e8e85f1]
23 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(+0x6ddea) [0x7fe71e8e9dea]
24 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(+0x64416) [0x7fe71e8e0416]
25 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0x47a6) [0x7fe71e8e4f46]
26 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(+0x6ddea) [0x7fe71e8e9dea]
27 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(+0x64416) [0x7fe71e8e0416]
28 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0x47a6) [0x7fe71e8e4f46]
29 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(+0x6ddea) [0x7fe71e8e9dea]
30 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(+0x64416) [0x7fe71e8e0416]
31 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0x47a6) [0x7fe71e8e4f46]
32 ./transfer_synthetic(+0xcb76) [0x55c388319b76]
33 ./transfer_synthetic(+0xed7e) [0x55c38831bd7e]
34 ./transfer_synthetic(+0x1050a) [0x55c38831d50a]
35 ./transfer_synthetic(+0x16e49) [0x55c388323e49]
36 ./transfer_synthetic(+0x10188) [0x55c38831d188]
37 ./transfer_synthetic(+0xae26) [0x55c388317e26]
38 /home/scarecrow/miniconda3/envs/data-transfer/lib/libpython3.7m.so.1.0(PyModule_ExecDef+0x4b) [0x7fe71ea2b3eb]
39 ./transfer_synthetic(+0xb709) [0x55c388318709]
40 ./transfer_synthetic(+0xbdaf) [0x55c388318daf]
41 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fe71e6ae083]
42 ./transfer_synthetic(+0xbe91) [0x55c388318e91]
=================================
Aborted (core dumped)
Current port is busy, terminate the process using it or use another port
The networking library could not use Infiniband. A fix is to recompile UCX with Infiniband support, see building UCX for more information.
If running dmesg
and the output looks like this:
(data-transfer) two-face@twoface:~/data-transfer$ dmesg
...
[18846.942527] port_module: 28 callbacks suppressed
[18846.942536] mlx5_core 0000:04:00.0: Port module event: module 0, Cable unplugged
[18846.944179] mlx5_core 0000:04:00.0: mlx5_pcie_event:293:(pid 4865): Detected insufficient power on the PCIe slot (26W).
[18846.944259] mlx5_core 0000:04:00.1: mlx5_pcie_event:293:(pid 4843): Detected insufficient power on the PCIe slot (26W).
[18846.989578] mlx5_core 0000:04:00.0 ens4f0np0: Link down
[18847.678446] mlx5_core 0000:04:00.1: Port module event: module 1, Cable unplugged
[18847.686809] mlx5_core 0000:04:00.1 ens4f1np1: Link down
The physical connection is down, make sure both computers are able to ping eachother before running the demo.
If during runtime, the linker cannot find libpython3.7m.so.1.0
like this:
error while loading shared libraries: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory
A temporal solution is to export the path to the module using:
export LD_LIBRARY_PATH=~/miniconda3/envs/data-transfer/lib/