Skip to content

infiniband setup

Animesh Trivedi edited this page Apr 6, 2021 · 7 revisions

Setup

Each IB network needs a subnet manager, typically this is on the switch but can be any machine. In our case, it is al01

When the links are up and the subnet manager is missing then you will see the ports as up, but initializing

atr@node1:/home/atr$ ibstat
CA 'mlx5_0'
	CA type: MT41682
	Number of ports: 1
	Firmware version: 18.26.1040
	Hardware version: 0
	Node GUID: 0x1c34da030072bbf6
	System image GUID: 0x1c34da030072bbf6
	Port 1:
		State: Down
		Physical state: Disabled
		Rate: 10
		Base lid: 65535
		LMC: 0
		SM lid: 0
		Capability mask: 0x2651e848
		Port GUID: 0x1c34da030072bbf6
		Link layer: InfiniBand
CA 'mlx5_1'
	CA type: MT41682
	Number of ports: 1
	Firmware version: 18.26.1040
	Hardware version: 0
	Node GUID: 0x1c34da030072bbf7
	System image GUID: 0x1c34da030072bbf6
	Port 1:
		State: Initializing
		Physical state: LinkUp
		Rate: 100
		Base lid: 65535
		LMC: 0
		SM lid: 0
		Capability mask: 0x2651e848
		Port GUID: 0x1c34da030072bbf7
		Link layer: InfiniBand

Then I started the subnet manager on node0, and then we have on node1

atr@node1:/home/atr$ ibstat
CA 'mlx5_0'
	CA type: MT41682
	Number of ports: 1
	Firmware version: 18.26.1040
	Hardware version: 0
	Node GUID: 0x1c34da030072bbf6
	System image GUID: 0x1c34da030072bbf6
	Port 1:
		State: Down
		Physical state: Disabled
		Rate: 10
		Base lid: 65535
		LMC: 0
		SM lid: 0
		Capability mask: 0x2651e848
		Port GUID: 0x1c34da030072bbf6
		Link layer: InfiniBand
CA 'mlx5_1'
	CA type: MT41682
	Number of ports: 1
	Firmware version: 18.26.1040
	Hardware version: 0
	Node GUID: 0x1c34da030072bbf7
	System image GUID: 0x1c34da030072bbf6
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 5
		LMC: 0
		SM lid: 1
		Capability mask: 0x2651e848
		Port GUID: 0x1c34da030072bbf7
		Link layer: InfiniBand

At this point it can be configured as any eth network.

Starting the subnet manager on al01

atr@al01:/home/atr$ sudo service opensmd status 
● opensmd.service - LSB: Manage OpenSM
     Loaded: loaded (/etc/init.d/opensmd; generated)
     Active: active (running) since Tue 2021-04-06 08:18:45 UTC; 1s ago
       Docs: man:systemd-sysv-generator(8)
    Process: 58592 ExecStart=/etc/init.d/opensmd start (code=exited, status=0/SUCCESS)
      Tasks: 92 (limit: 309035)
     Memory: 14.5M
     CGroup: /system.slice/opensmd.service
             └─58609 /usr/sbin/opensm --daemon --pidfile /var/run/opensm.pid

Apr 06 08:18:45 al01 systemd[1]: Starting LSB: Manage OpenSM...
Apr 06 08:18:45 al01 opensmd[58592]: Starting opensm:  * done
Apr 06 08:18:45 al01 OpenSM[58609]: /var/log/opensm.log log file opened
Apr 06 08:18:45 al01 OpenSM[58609]: OpenSM 5.7.2.MLNX20201014.9378048
Apr 06 08:18:45 al01 systemd[1]: Started LSB: Manage OpenSM.
Apr 06 08:18:45 al01 OpenSM[58609]: Entering DISCOVERING state
Apr 06 08:18:45 al01 OpenSM[58609]: Entering MASTER state

File logs are at /var/log/opensm.log

Setting up static ip and netplan

which driver the nic is using, you can check with

atr@al01:/home/atr$ ethtool -i ibs2f1 
driver: mlx5_core[ib_ipoib]
version: 4.9-2.2.4
firmware-version: 18.28.2006 (MT_0000000244)
expansion-rom-version: 
bus-info: 0000:86:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
atr@al01:/home/atr$ 

make sure ib_ipoib module is in.

Edit the /etc/netplan/00-installer-config.yaml file as

# This is the network config written by 'subiquity'
network:
  ethernets:
    eno1:
      addresses:
      - 192.168.1.103/16
      gateway4: 192.168.1.100
      nameservers:
        addresses:
        - 1.1.1.1
        - 1.1.1.1
        - 8.8.8.8
        search: []
    ibs2f1:
      dhcp4: no
      addresses:
        - 10.10.1.103/16
      gateway4: 192.168.1.100
      nameservers:
        addresses: [1.1.1.1, 8.8.8.8]
        search: []

  version: 2

What hardware we have

[atr@node1 ~]$ lspci | grep -i Mellanox 
86:00.0 Infiniband controller: Mellanox Technologies MT416842 BlueField integrated ConnectX-5 network controller
86:00.1 Infiniband controller: Mellanox Technologies MT416842 BlueField integrated ConnectX-5 network controller
86:00.2 DMA controller: Mellanox Technologies MT416842 BlueField SoC management interfac
[atr@node1 ~]$ 

The guide is available here: https://docs.mellanox.com/display/bluefieldsniceth/Hardware+Installation

Download and Install sw

Installing MOFED : https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed

Logs

April 6th 2021 : mlx4-installation-log on node 3

Clone this wiki locally