Pegasi Wiki

This wiki acts as a memo for our own work so why not share them? Feel free to browse and use out notes and leave a note while at it.

Infiniband RDMA native setup for Linux

Overview

Native Infiniband RDMA enables lossless storage traffic, low CPU loads, high speeds and a very low latency. We decided to with 40Gbps native Infiniband with our new NVMe based storage backend. For software solution we use Linstor that supports native Infiniband RDMA and gives us flexibility.

Here is a quick sheet on how to get native Infiniband up and running with Almalinux 8 / CentOS 8 / RHEL 8.

Interfaces

We have ConnectX-3 cards and two Mellanox 56Gbps switches where one will be a stand-by and other will be in production. We have connected cables our two storage backend nodes and one of our front end nodes. The rest are still operating in the legacy storage and will be upgraded once the virtual guests have been migrated to the new storage.

Setup

Here are the tasks required to set up the native Infiniband environment. Do this in all storage servers. One server needs to be primary and it needs to hold the highest PRIORITY flag (look below).

  • Almalinux 8 minimal install
  • dnf in vim rdma-core libibverbs-utils librdmacm librdmacm-utils ibacm infiniband-diags opensm
  • systemctl enable rdma
  • systemctl start rdma
  • ip link show ib0
  • ip link show ib1
    • Write down the last 8 bytes of the infiniband MAC addresses
  • vim /etc/udev/rules.d/70-persistent-ipoib.rules. Add lines and replace the “xx:xx:xx:xx:xx:xx:xx:xx” with the bytes you copied above:
ACTION=="add", SUBSYSTEM=="net", DRIVERS=="?*", ATTR{type}=="32", ATTR{address}=="?*xx:xx:xx:xx:xx:xx:xx:xx", NAME="mlx4_ib0
ACTION=="add", SUBSYSTEM=="net", DRIVERS=="?*", ATTR{type}=="32", ATTR{address}=="?*xx:xx:xx:xx:xx:xx:xx:xx", NAME="mlx4_ib1
  • vim /etc/security/limits.d/rdma.conf. Add lines:
@rdma    soft    memlock     unlimited
@rdma    hard    memlock     unlimited
  • ibstat
    • write down the mlx* port GUIDs
  • do not touch /etc/rdma/opensm.conf
  • vim /etc/sysconfig/opensm. Modify following and replace the GUIDS with the ones you wrote down (to master server PRIORITY 15, others below that):
GUIDS="0xXXXXXXXXXXXXXX 0xXXXXXXXXXXXXXX"
PRIORITY=15
  • vim /etc/rdma/partitions.conf, add your native parition definition as follows
DataVault_A=0x0002,rate=7,mtu=4,scope=2,defmember=full:ALL=full;
  • systemctl enable opensm
  • reboot

IP setup for Infiniband

Originally I did not want to do this but since iSER seems to outperform SRP then why not try it.

Let's use nm-cli and setup our interfaces with commands. If you for some strange reason do not have ib0/ib1 devices set up automatically you can add one with this command

nmcli connection add type infiniband con-name ib0 ifname ib0 transport-mode Connected mtu 65520

Otherwise you can do

nmcli connection modify ib0 transport-mode Connected
nmcli connection modify ib0 mtu 65520
nmcli connection modify ib0 ipv4.addresses '10.0.0.1/24'
nmcli connection modify ib0 ipv4.method manual
nmcli connection modify ib0 connection.autoconnect yes
nmcli connection up ib0

I skipped gateway / dns setups since I do not need them in a storage network.

iSER Target setup

This is a very compact list of commands and example terminal output

dnf in targetcli
systemctl enable target --now
firewall-cmd --permanent --add-port=3260/tcp
firewall-cmd --reload
targetcli
/> ls
o- / ......................................................................................................................... [...]
  o- backstores .............................................................................................................. [...]
  | o- block .................................................................................................. [Storage Objects: 0]
  | o- fileio ................................................................................................. [Storage Objects: 0]
  | o- pscsi .................................................................................................. [Storage Objects: 0]
  | o- ramdisk ................................................................................................ [Storage Objects: 0]
  o- iscsi ............................................................................................................ [Targets: 0]
  o- loopback ......................................................................................................... [Targets: 0]
  o- srpt ............................................................................................................. [Targets: 0]
/> iscsi/
/iscsi> create
/iscsi> create iqn.2021-01-01.com.domain:abc01
Created target iqn.2021-01-01.com.domain:abc01.
Created TPG 1.
Global pref auto_add_default_portal=true
Created default portal listening on all IPs (0.0.0.0), port 3260.
/iscsi> ls
o- iscsi .............................................................................................................. [Targets: 1]
  o- iqn.2021-01-01.com.domain:abc1 .................................................................................... [TPGs: 1]
    o- tpg1 ................................................................................................. [no-gen-acls, no-auth]
      o- acls ............................................................................................................ [ACLs: 0]
      o- luns ............................................................................................................ [LUNs: 0]
      o- portals ...................................................................................................... [Portals: 1]
        o- 0.0.0.0:3260 ....................................................................................................... [OK]
/iscsi> /backstores/block 
/backstores/block> create name=dv1_nvme0n1 dev=/dev/nvme0n1
Created block storage object dv1_nvme0n1 using /dev/nvme0n1.
/backstores/block> create name=dv1_nvme1n1 dev=/dev/nvme1n1
Created block storage object dv1_nvme1n1 using /dev/nvme1n1.
/backstores/block> create name=dv1_nvme2n1 dev=/dev/nvme2n1
Created block storage object dv1_nvme2n1 using /dev/nvme2n1.
/backstores/block> ls
o- block ...................................................................................................... [Storage Objects: 3]
  o- dv1_nvme0n1 .................................................................... [/dev/nvme0n1 (1.5TiB) write-thru deactivated]
  | o- alua ....................................................................................................... [ALUA Groups: 1]
  |   o- default_tg_pt_gp ........................................................................... [ALUA state: Active/optimized]
  o- dv1_nvme1n1 .................................................................... [/dev/nvme1n1 (1.5TiB) write-thru deactivated]
  | o- alua ....................................................................................................... [ALUA Groups: 1]
  |   o- default_tg_pt_gp ........................................................................... [ALUA state: Active/optimized]
  o- dv1_nvme2n1 .................................................................... [/dev/nvme2n1 (1.5TiB) write-thru deactivated]
    o- alua ....................................................................................................... [ALUA Groups: 1]
      o- default_tg_pt_gp ........................................................................... [ALUA state: Active/optimized]
/backstores/block> /iscsi/iqn.2021-01-01.com.domain:abc1/tpg1/portals delete 0.0.0.0 3260
Deleted network portal 0.0.0.0:3260
/backstores/block> /iscsi/iqn.2021-01-01.com.domain:abc1/tpg1/portals create 192.168.222.1 3260
Using default IP port 3260
Created network portal 192.168.222.1:3260.
/backstores/block> /iscsi/iqn.2021-01-01.com.domain:abc1/tpg1/portals/192.168.222.1:3260 enable_iser boolean=true
iSER enable now: True
/backstores/block> /
/> ls
o- / ......................................................................................................................... [...]
  o- backstores .............................................................................................................. [...]
  | o- block .................................................................................................. [Storage Objects: 3]
  | | o- dv1_nvme0n1 ................................................................ [/dev/nvme0n1 (1.5TiB) write-thru deactivated]
  | | | o- alua ................................................................................................... [ALUA Groups: 1]
  | | |   o- default_tg_pt_gp ....................................................................... [ALUA state: Active/optimized]
  | | o- dv1_nvme1n1 ................................................................ [/dev/nvme1n1 (1.5TiB) write-thru deactivated]
  | | | o- alua ................................................................................................... [ALUA Groups: 1]
  | | |   o- default_tg_pt_gp ....................................................................... [ALUA state: Active/optimized]
  | | o- dv1_nvme2n1 ................................................................ [/dev/nvme2n1 (1.5TiB) write-thru deactivated]
  | |   o- alua ................................................................................................... [ALUA Groups: 1]
  | |     o- default_tg_pt_gp ....................................................................... [ALUA state: Active/optimized]
  | o- fileio ................................................................................................. [Storage Objects: 0]
  | o- pscsi .................................................................................................. [Storage Objects: 0]
  | o- ramdisk ................................................................................................ [Storage Objects: 0]
  o- iscsi ............................................................................................................ [Targets: 1]
  | o- iqn.2021-01-01.com.domain:abc1 .................................................................................. [TPGs: 1]
  |   o- tpg1 ............................................................................................... [no-gen-acls, no-auth]
  |     o- acls .......................................................................................................... [ACLs: 0]
  |     o- luns .......................................................................................................... [LUNs: 0]
  |     o- portals .................................................................................................... [Portals: 1]
  |       o- 192.168.222.1:3260 .................................................................................................. [iser]
  o- loopback ......................................................................................................... [Targets: 0]
  o- srpt ............................................................................................................. [Targets: 0]
/> iscsi/iqn.2021-01-01.com.domain/tpg1/luns
/> create /backstores/block/dv1_nvme0n1
/> create /backstores/block/dv1_nvme1n1
/> create /backstores/block/dv1_nvme2n1
/> iscsi/iqn.2021-01-01.com.domain:abc1/tpg1/acls create iqn.2021-07.host-z.domain.com
/> iscsi/iqn.2021-01-01.com.domain:abc1/tpg1/acls create iqn.2021-07.host-x.domain.com
/> iscsi/iqn.2021-01-01.com.domain:abc1/tpg1/acls create iqn.2021-07.host-y.domain.com
/> saveconfig

Look the initiator names from the frontends

iSER initiator setup

iscsiadm -m discovery -t st -p 192.168.222.1:3260
192.168.222.1:3260,1 iqn.2021-01-01.com.domain:abc1
iscsiadm -m node -T iqn.2021-01-01.com.domain:abc1 -o update -n iface.transport_name -v iser
iscsiadm -m node -l
systemctl enable iscsid --now

NVME-of RDMA Target Setup

Firewalld

firewall-cmd --new-zone=nvmeof --permanent
firewall-cmd --reload
firewall-cmd --zone=nvmeof --add-source=1.2.3.4/24 --permanent
firewall-cmd --zone=nvmeof --add-source=1.2.3.5/24 --permanent
firewall-cmd --zone=nvmeof --add-port=4420/tcp --permanent
firewall-cmd --reload

NVME-of setup with config filesystem

/bin/mount -t configfs none /sys/kernel/config/ #if configfs not mounted
modprobe nvmet-rdma
echo nvmet-rdma > /etc/modules-load.d/nvme.conf
mkdir /sys/kernel/config/nvmet/subsystems/datavault01
cd /sys/kernel/config/nvmet/subsystems/datavault01
echo 1 > attr_allow_any_host
mkdir namespaces/10
cd namespaces/10
echo -n /dev/nvme0n1 > device_path
echo 1 > enable
mkdir /sys/kernel/config/nvmet/ports/1
cd /sys/kernel/config/nvmet/ports/1
echo -n <ip address> > addr_traddr #use the ib0/ib1 address that is connected to the client
echo rdma > addr_trtype
echo 4420 > addr_trsvcid
echo ipv4 > addr_adrfam
ln -s /sys/kernel/config/nvmet/subsystems/datavault01 /sys/kernel/config/nvmet/ports/1/subsystems/datavault01
dmesg | grep "enabling port"
[1034711.759527] nvmet_rdma: enabling port 1 (<ip address>:4420)

Save config and enable boot time

nvmetcli save
systemctl enable nvmet

nvmet service starts too early so we must make it start at a more appropriate time. Look for systemd targets with command

systemctl list-units --type target

And look for network-online.target or similar which must be active before nvmet can kick in. Modify /usr/lib/systemd/system/nvmet.service and set “After” line to this:

After=sys-kernel-config.mount network.target local-fs.target NetworkManager-wait-online.service

Also in my case the IB interfaces initialize very slow and I had to set a delay under [Service]

ExecStartPre=/bin/sleep 40

Then run “systemctl daemon reload” and try reboot

NVME-of RDMA target add device

cd /sys/kernel/config/nvmet/subsystems/datavault01
mkdir namespaces/11
cd namespaces/11
echo -n /dev/nvme1n1 > device_path
echo 1 > enable

NVME-of RDMA client setup

dnf in nvme-cli
modprobe nvme-rdma
echo "nvme-rdma" > /etc/modules-load.d/nvme.conf
nvme discover -t rdma -a <server ip address> -s 4420
nvme connect -t rdma -n testnqn -a <server ip address> -s 4420
systemctl enable nvmf-autoconnect
echo "-t rdma -a <ip address> -s 4420" > /etc/nvme/discovery.conf
echo "-t rdma -a <ip address> -s 4420" >> /etc/nvme/discovery.conf

Autoconnect wants to start connecting before IB interface gets IP addresses. Modify /usr/lib/systemd/system/nvmf-autoconnect.service and set a delay under [Service]:

ExecStartPre=/bin/sleep 40

Then run “systemctl daemon reload” and try reboot

Test

Test RDMA connectivity with ibping. First write down each server Port GUIDs by issuing command

ibstat

Then start ibping in server mode on each server

ibping -S

From each client run ping with the Port GUID you wrote down earlier

ibping -G 0xXXXXX

You should see very low latency pongs as a response.

Other notes

If you want to change your Mellanox ConnectX-3 or above card to Infiniband, Ethernet or autodetect mode install package mstflint and use commands:

lspci | grep Mellanox                         # find out the PCI address
mstconfig -d NN:NN.n q                        # query the card
mastconfig -d NN:NN.n set LINK_TYPE_P1=2      # set mode 1=ib, 2=ethernet, 3=autodetect

Comments

All comments and corrections are welcome.

 stars  from 0 votes

Leave a comment

Enter your comment:
V I M T J
 

  //check if we are running within the DokuWiki environment if (!defined("DOKU_INC")){ die(); } //place the needed HTML source codes BELOW this line