Skip to content

cluster info

Matthijs Jansen edited this page Dec 14, 2021 · 74 revisions

Events

  • Enable firewall on headnode al01. Only port 22 traffic is allowed in.
  • Nov 11th, 2021 - node 4 and node5 are operational
  • node 3 is also updated.
  • node 1 is upgraded to 5.12+ (deb packages in atr@node1:/home/atr/linux-v5.12, compiled as https://wiki.ubuntu.com/KernelTeam/GitKernelBuild)

Cluster users

https://docs.google.com/spreadsheets/d/1yUdQA4BveaQWB5d_t_VRTdcXiYVFT3FkxpIX6Ej_4og/edit#gid=0

Cluster Information

We have 4 machine cluster that we will do experiments one. Here are the notes.

Please enjoy the cluster responsibly whenever in doubt please post issues on the slack as well. There are many clusters in the world, but I like mine :)

Software packages and maintenance

if you are installing a new package with sudo apt-get on one machine, please make sure to install on all machines to keep the software in sync.

[atr: April 6th log]

  • common packages installed on all sudo apt-get install build-essential cmake git libaio1 libaio-dev ifstat numactl flex libncurses-dev elfutils libelf-dev libssl-dev net-tools inetutils-tools inetutils-traceroute fio
  • change the default editor from nano (duh!) to vim: sudo update-alternatives --config editor (pick the vim.basic)
  • enable passwordless sudo (be careful)
    • sudo visudo
    • then add %sudo ALL=(ALL) NOPASSWD: ALL (if it is not there already)

packages installed

On a freshly installed machine

  1. Make your account using atl account
  2. change default text editor
sudo update-alternatives --config editor
  1. Enable passwordless sudo
%sudo   ALL=(ALL) NOPASSWD: ALL
  1. Disable password login (if needed, we have this on al01)

in /etc/ssh/ssd_config

PasswordAuthentication no
PubkeyAuthentication yes

  1. Put the name and IP in all nodes /etc/hosts file
atr@al01:~$ cat /etc/hosts
127.0.0.1 localhost
#127.0.1.1 al01
192.168.1.100 al01
192.168.1.100 node0
192.168.1.101 node1
192.168.1.102 node2
192.168.1.103 node3
192.168.1.104 node4
192.168.1.105 node5

How to access

Ask us to setup an account for you. Send us a username (you) and your ssh public key. No password access please.

Step 1 : ssh VUNETID@ssh.data.vu.nl (login here using your vunetid and password)

Step 2: ssh you@al01.anac.cs.vu.nl

al01 is a special head node, 1 of the 4 machines that we have. For all sense and purposes it is the same like other machines, but be careful with the network setting. As if this node goes down, everything goes down.

Sample of ssh config file that atr is using (~/.ssh/config) :

ServerAliveInterval 10

Host vu-ssh
	HostName ssh.data.vu.nl
	User ati850
	IdentityFile ~/.ssh/das.pub

Host das5
	HostName fs0.das5.cs.vu.nl
	User atrivedi
	ProxyJump vu-ssh

Host al01
	HostName al01.anac.cs.vu.nl
	User atr
	ProxyJump vu-ssh
	IdentityFile ~/.ssh/al01.pub

What is the network configuration

1 Gbps link is

  • 192.168.1.100 (al01, head node)
  • 192.168.1.101
  • 192.168.1.102
  • 192.168.1.103

IPMI IPs

  • 192.168.1.200 (al01, head node)
  • 192.168.1.201
  • 192.168.1.202
  • 192.168.1.203

IB network

  • 10.10.1.100 (al01, head node)
  • 10.10.1.101
  • 10.10.1.102
  • 10.10.1.101

al01 has following

  IPv4 address for br-fb53e8de1dd2: 172.18.0.1
  IPv4 address for docker0:         172.17.0.1
  IPv4 address for docker_gwbridge: 172.19.0.1
  IPv4 address for eno1:            192.168.1.100
  IPv4 address for eno2:            130.37.193.10
  IPv6 address for eno2:            2001:610:110:6e1::a
  IPv4 address for ibs2f1:          10.10.1.100
  IPv4 address for tun0:            10.8.0.3

atr@al01:~$ ifconfig eno2 
eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 130.37.193.10  netmask 255.255.255.0  broadcast 130.37.193.255
        inet6 2001:610:110:6e1::a  prefixlen 128  scopeid 0x0<global>
        inet6 fe80::3eec:efff:fe04:c317  prefixlen 64  scopeid 0x20<link>
        ether 3c:ec:ef:04:c3:17  txqueuelen 1000  (Ethernet)
        RX packets 989652951  bytes 974806124551 (974.8 GB)
        RX errors 0  dropped 505195  overruns 0  frame 0
        TX packets 203380698  bytes 77332779571 (77.3 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device memory 0xaae00000-aae1ffff  

Hostnames

This file has been copied to all hosts /etc/hosts, please check the hostname and convention here:

127.0.0.1 localhost
#127.0.1.1 al01
192.168.1.100 al01
192.168.1.100 node0
192.168.1.101 node1
192.168.1.102 node2
192.168.1.103 node3
10.10.1.100 node0-ib
10.10.1.101 node1-ib
10.10.1.102 node2-ib
10.10.1.103 node3-ib

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

How to setup a new node

  1. Plugin a usb-stick with ubuntu-server iso
  2. Plugin a monitor / keyboard / mouse
  3. Boot the server
  4. Enter bios, disable hyperthread and move usb-stick to the top of the boot order
  5. Restart and boot ubuntu server. Install on a non-nvme SSD if possible
  6. Set hostname to nodeX, username to atl and password to ...
  7. ip a to get your ethernet interface, then ethtool eno1 to check if ethernet connection is detected
  8. Set the static IP. You can do this in many ways, one way is to use netplan. Edit /etc/netplan/00-installer-config.yaml to look as follows, with X being your node number
# This is the network config written by 'subiquity'
network:
  ethernets:
    eno1:
      addresses:
      - 192.168.1.10X/16
      gateway4: 192.168.1.100
      nameservers:
        addresses: [1.1.1.1, 8.8.8.8]
        search: []
  version: 2
  1. Set the configuration: sudo netplan generate and sudo netplan apply

To make NAT work

(needed on the head node after reboot) This is not persistent

echo 1 > /proc/sys/net/ipv4/ip_forward
sudo iptables -t nat -I POSTROUTING --out-interface eno2 -j MASQUERADE # check the right NIC 

Apply network update settings

in /etc/netplan

  1. sudo netplan apply
  2. sudo systemctl restart systemd=network

How to add a new user with sudo access

sudo useradd -s /bin/bash -d /home/atr/ -m -G sudo atr

delete user: sudo userdel -f -r [???]

add user to the sudo group: usermod -aG sudo id

On nodes 2, 3 and 4 there is a default username (atl) and password. Use that to create the user on machines. After a user is created, change its password

# as atl 
$ sudo su 
$ (as root) passwd new_user 

note we may want to share this with students and let them do their own user account management.

How to power cycle and reboot machines and use IPMI tools

Get to the al01 node. From there (broken atm):

  • Accessing a machines's event logs: ipmitool -H 192.168.1.201 -U username -P password sel
  • Rebooting a machine : ipmitool -H 192.168.1.201 -U username -P password power reset (do not do power cycle)

https://www.thomas-krenn.com/en/wiki/Configuring_IPMI_under_Linux_using_ipmitool

IP forwarding NAT

on al01

sudo iptables -t nat -I POSTROUTING --out-interface eno2 -j MASQUERADE

Setting up the web interface access to IPMI using it on Firefox

  1. Create ssh tunnel: ssh -D 1080 -q -N username@al01.anac.cs.vu.nl
  2. Firefox -> Preferences -> Connection settings -> Socks host: localhost, port 1080, SOCKS_v5
  3. Browse to 192.168.1.201
  4. Login
    • username: username
    • password: password

Booting into bios from IPMI

ipmitool -H 192.168.1.203 -U [username] -P [password] chassis bootdev bios

NFS server settings

Ubuntu NFS write up:

Install package on the server side: apt-get install nfs-kernel-server (if in case it is missing)

Then on the server in the /etc/exports file

  1. Add this line: /srv/nfstest (rw,sync,all_squash,anonuid=1026,anongid=1026)
  2. Then set the correct permission chown -R nobody:nogroup /srv/nfstest/

[April 6th] atr: I am setting 777 permissions on /srv/nfstest

What are the different NFS export options do: https://linux.die.net/man/5/exports

On the client side:

  • may be packages are missing: sudo apt-get install rpcbind nfs-common
  • create the mount point, if missing: sudo mkdir -p /mnt/nfs
  • then mount sudo mount -t nfs 192.168.1.100:/srv/nfstest/ /mnt/nfs/

This is not mounted by default on the fresh booted machine, so in case this is missing, please mount it at /mnt/nfs. We should put this in fstab

[ref] https://linuxize.com/post/how-to-mount-an-nfs-share-in-linux/

How to restart the NFS server : service nfs-kernel-server restart (there are other options too: {start|stop|status|reload|force-reload|restart})

How to check if nfs is mounted:

atr@node1:/mnt/nfs$ mount | grep nfs 
192.168.1.100:/srv/nfstest on /mnt/nfs type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.101,local_lock=none,addr=192.168.1.100)
atr@node1:/mnt/nfs$ 

Boot sequence

  • Boot mode select: DUAL

  • Legacy to EFI support: Disabled

  • Boot option #1: UEFI USB Key

  • Boot option #2: USB Hard Disk: Samsung Flash Drive FIT 1100

  • Boot option #3: UEFI Network

  • Boot option #4: UEFI Hard DIsk

  • Boot option #5: Hard Disk: Intel SSDSC2KB400GB

  • Boot option #6: USB Floppy

  • Boot option #7: USB Lan

  • Boot option #8: CD/DVD

  • Boot option #9: Network: IBA GE Slot 1800 v1584

  • Boot option #10: UEFI CD/DVD

  • Boot option #11: USB CD/DVD

  • Boot option #12: UEFI USB CD/DVD:UEFI:Samsung Flaash Drive FIT 1100

InfiniBand details

infiniband-setup

Firewall Config

  1. First allow all traffic by default
sudo ufw default allow all
sudo ufw default deny all
  1. Allow incoming traffic on port 22 at top priority
sudo ufw insert 1 allow from any proto tcp to any port 22
  1. Deny incoming traffic on all other ports on interface eno2 (connection to internet)
sudo ufw deny in on eno2
  1. Commands to enable or disable firewall
sudo ufw enable
sudo ufw disable
  1. Commands to delete rules
sudo ufw status numbered
sudo ufw delete <number>
  1. Command to see added rules without enabling firewall
sudo ufw show added
  1. Command to see default rules
sudo ufw status verbose
  1. Allow traffic from other nodes to be forwarded to the internet
sudo ufw route allow in on eno1 out on eno2

Specs

System Configuration

  • CPU: 2 x Intel Xeon Silver 4210R (10 cores, 3.2GHz) = 20 cores
  • DRAM: 4 x 64GB DDR4-2400 = 256GB
  • Optane: 2 x 280GB Optane SSD 900p = 560GB
  • Boot Drive: 480GB Intel SATA SSD D3-S4510
  • NIC: Mellanox Bluefield (ConnectX-5 generation)
  • PCIe ports: 4 x x16; 1 x x8

Raspberry Pi

Deployment phases of the Raspberry Pi:

  1. By default, a Raspberry Pi 4 can only boot using a dedicated power cable, and a monitor cable (e.g. HDMI).
  2. We want to enable booting a Pi without a dedicated power cable, using only USB-C -> USB-A/C connected to a PC. With this, you can operate a Pi by SSH'ing from your PC. This step is needed to install everything necessary for network booting.
  3. Finally, we want to enable network booting so a Pi is connected with an ethernet cable to a network switch in the cluster (and so to the head node). This allows us to stop using the Pi's SD card by storing the OS on the head node. The Pi gets its power from a USB hub.

USB deployment

  1. Connect the power and a monitor to the Pi, and boot. Internet is not needed. Remember your username and password
  2. Add dtoverlay=dwc2 to /boot/config.txt
  3. Add modules-load=dwc2,g_ether to /boot/cmdline.txt
  4. Reboot
  5. Edit /etc/dhcpcd.conf to set a static IP, similar to this:
# Example static IP configuration:
interface usb0
static ip_address=192.168.100.10/24
static routers=192.168.100.1
static domain_name_servers=8.8.8.8
  1. Shut the Pi down and connect it to a PC (using USB for example)
  2. Your PC will attempt to connect to the Raspberry Pi via the wired link. Open the settings of this network, and on the IPv4 tab (this was tested on Ubuntu) change the following:
  • IPv4 Method: Manual
  • Create a new address entry: Address = 192.168.100.11. Netmask = 24 (same as 255.255.255.0). Gateway = 192.168.100.1
  1. Now the connection should be established. Open a terminal, ping 192.168.100.10 to test the connection, if that works then ssh pi@192.168.100.10 to get access to the Pi over SSH.

Network boot deployment

  1. Enable network booting on the Raspberry Pi. Connect it to a PC via USB and connect it to ethernet as we will download packages. Follow this tutorial to enable network booting on the Pi.
Clone this wiki locally