- Operating environment
- Should I use Infiniband mode or Ethernet mode?
- Using both ports of a dual-port VPI NIC - what do I get?
- Quick start - Ethernet mode
- Quick Start - Infiniband mode
- Troubleshooting steps
- Checking performance
- Troubleshooting and improving performance
- Links
- Search and buying links
ToC created by gh-md-toc
- Ubuntu Bionic 18.04.3
- Vanilla upstream kernel 5.4.3 from kernel.org (no additional patches applied)
- Lower latency
- Can use for low-latency, high-bandwidth storage links, using iSCSI
- Can additionally use RDMA (remote DMA) for things like NFS
- If you do not have a VPI NIC, you probably cannot put the port in Ethernet mode anyway
- Requires additional packages to be installed - opensm, rdma-core
- I have not explored using both ports of a Dual-Port 40GBit/sec ConnectX-3 VPI Mellanox NIC card in Infiniband mode - though it is possible
- Does not require installing the opensmandrdma-corepackages and running theopensmserver
- Installation and configuration steps are simpler and more familiar
- Renaming the interface names is easier / more familiar to experienced Linux users
WITHOUT using bonding, you can ONLY have two independent network links - e.g. MachineA <---> MachineB and MachineA <---> MachineC simultaneously.
Using bonding, you can ONLY get balanced round-robin mode - this means:
- You do NOT get double the bandwidth on the bonded link
- You get RESILIENCE from:
- Failure of one network cable / transceiver
- A single network cable being disconnected on one or both ends
- A single port failure on any one machine
- Failure of a single port on both machines at either end of a single network cable
 
Folliwng steps need to be done on BOTHmachines
apt install mstflint infiniband-diags
You do not need rdma-core or opensm
sudo mstconfig query
Output will look like:
Device #1:
----------
Device type:    ConnectX3       
PCI device:     /sys/bus/pci/devices/0000:05:00.0/config
Configurations:                              Next Boot
         SRIOV_EN                            True(1)         
         NUM_OF_VFS                          16              
         LINK_TYPE_P1                        ETH(2)          
         LINK_TYPE_P2                        VPI(3)          
         LOG_BAR_SIZE                        5               
         BOOT_PKEY_P1                        0               
         BOOT_PKEY_P2                        0               
         BOOT_OPTION_ROM_EN_P1               True(1)         
         BOOT_VLAN_EN_P1                     False(0)        
         BOOT_RETRY_CNT_P1                   0               
         LEGACY_BOOT_PROTOCOL_P1             PXE(1)          
         BOOT_VLAN_P1                        0               
         BOOT_OPTION_ROM_EN_P2               True(1)         
         BOOT_VLAN_EN_P2                     False(0)        
         BOOT_RETRY_CNT_P2                   0               
         LEGACY_BOOT_PROTOCOL_P2             PXE(1)          
         BOOT_VLAN_P2                        0               
         IP_VER_P1                           IPv4(0)         
         IP_VER_P2                           IPv4(0)         
         CQ_TIMESTAMP                        True(1)         
PCI device:     /sys/bus/pci/devices/0000:05:00.0/config : Device name to use with mstconfig is in bold```
Device type:    ConnectX3 : ConnectX3 says it is a PCI-Express 3.x capable card
LINK_TYPE_P1                        VPI(3) : Says Port 1 is in VPI (Auto) mode
LINK_TYPE_P2                        ETH(2) : Says Port 2 is in Ethernet mode
On a VPI-capable card, port type can be any of:
- 1 : Infiniband
- 2 : Ethernet
- 3 : VPI (Auto)
mstconfig -d 0000:05:00.0 set LINK_TYPE_P1=2
mstconfig -d 0000:05:00.0 set LINK_TYPE_P2=2
Reboot - mstconfig port type settings will only take effect after a reboot.
After reboot, (even without connecting the cables), you should see two new interfaces listed by ifconfig command. For example:
ifconfig
Output should look like:
ens5p1: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether 00:02:c9:3e:ca:b0  txqueuelen 10000  (Ethernet)
        RX packets 2142  bytes 248089 (248.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 6121  bytes 5923956 (5.9 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        
ens5: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether 00:02:c9:3e:ca:b1  txqueuelen 10000  (Ethernet)
        RX packets 2142  bytes 248089 (248.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 6121  bytes 5923956 (5.9 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
The ens5 and ens5p1 interface names may be different
Create a file named /etc/udev/rules.d/70-40g-nic.rules containing the following lines
# Mellanox ConnectX-3 HP 649281-B21 IB FDR/EN 10/40Gb 2P 544QSFP Adapter 656089-001
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:02:c9:3e:ca:b1", ATTR{dev_id}=="0x0", ATTR{type}=="1", NAME="eth40a"
# Mellanox ConnectX-3 HP 649281-B21 IB FDR/EN 10/40Gb 2P 544QSFP Adapter 656089-001
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:02:c9:3e:ca:b0", ATTR{dev_id}=="0x0", ATTR{type}=="1", NAME="eth40b"
IMPORTANT: Change the ATTR{address}=="00:02:c9:3e:ca:b1" and ATTR{address}=="00:02:c9:3e:ca:b0" to reflect your actual MAC addresses. Leave the names as eth40a and eth40b.
If you have only one port, you will only add one (uncommented) line (eth40a)
Reboot again to let the persistent names take effect.
If you have a single-port 40GBit/sec NIC or have only a single cable, you can only use one port on each end (at least at a time). So in such cases, bonding does not make sense.
Edit /etc/network/interfaces/eth40a to contain:
allow-hotplug eth40a
iface eth40g inet static
    hwaddress 00:02:c9:3e:ca:b0
    address 10.30.0.2
    netmask 255.255.255.0
    broadcast 10.30.0.255
Replace 10.30.0.2 with static IP address of MachineA and MachineB respectively. Replace netmask 255.255.255.0 and broadcast 10.30.0.255 based on your IP addresses
ifdown eth40a; ifup eth40a
- Add bondingto/etc/modules- loadbondingmodule on reboot
- modprobe bonding- until next reboot
Edit /etc/network/interfaces/bond0 to contain:
allow-hotplug eth40a
iface eth40a inet manual
    bond-mode active-backup
    bond-master bond0
    pre-up ifconfig eth40a txqueuelen 10000 2>/dev/null || true
allow-hotplug eth40b
iface eth40b inet manual
    bond-mode active-backup
    bond-master bond0
    pre-up ifconfig eth40b txqueuelen 10000 2>/dev/null || true
allow-hotplug bond0
iface bond0 inet static
    bond-mode balance-rr
    use_carrier 1
    bond-slaves eth40a eth40b
    bond-miimon 100
    bond-downdelay 200
    bond-updelay 200
 	address 10.30.0.2
 	netmask 255.255.0.0
	broadcast 10.30.255.255
    pre-down ifconfig eth40a down 2>/dev/null || true
    pre-down ifconfig eth40b down 2>/dev/null || true
    pre-up ifconfig eth40a up 2>/dev/null || true
    pre-up ifconfig eth40b up 2>/dev/null || true
Note that IP address, netmask and broadcast address are ONLY associated with the bond0 address.
ifdown bond0 2</dev/null
ifdown eth40a 2>/dev/null
ifdown eth40b 2>/dev/null
ifup bond0 &
ifup eth40a; ifup eth40b
cat /proc/net/bonding/bond0
Output should look like:
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: load balancing (round-robin)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200
Peer Notification Delay (ms): 0
Slave Interface: eth40b
MII Status: up
Speed: 40000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:02:c9:3e:ca:b0
Slave queue ID: 0
Do the following steps on BOTH machines connected by 40Gbit Infiniband cable.
apt install rdma-core opensm ibutils infiniband-diags
Create new file under /etc/network/interfaces.d - e.g. named ib0_4g (the filename is not important) containing:
allow-hotplug ib0
iface ib0 inet static
    address 10.30.0.1
    netmask 255.255.255.0
    broadcast 10.30.0.255
Change the contents of the file to reflect your network (IP address, netmask, broadcast address). Do NOT change the interface name (ib0).
Do this AFTER performing above steps on BOTH computers.
Your interface (ib0) should be seen, with the right IP address, netmask etc.
If not, use the command ifconfig -a to see whether interface ib0 was detected at all.
- Use ibstatto see status of interface(s)
- cat /sys/class/net/ib0/mode- should show- connectedor- datagram
- Check whether opensmservice is running:systemctl status opensm. Output should resemble the following:
● opensm.service - LSB: Start opensm subnet manager.
   Loaded: loaded (/etc/init.d/opensm; generated)
   Active: active (running) since Mon 2019-12-30 00:07:32 PST; 1 day 10h ago
     Docs: man:systemd-sysv-generator(8)
  Process: 3060 ExecStart=/etc/init.d/opensm start (code=exited, status=0/SUCCESS)
    Tasks: 39 (limit: 23347)
   CGroup: /system.slice/opensm.service
           └─3110 /usr/sbin/opensm -g 0x0002c903003f02f2 -f /var/log/opensm.0x0002c903003f02f2.log
Dec 30 00:07:32 fili opensm[3060]: Starting opensm on 0x0002c903003f02f2:
Dec 30 00:07:32 fili systemd[1]: Started LSB: Start opensm subnet manager..
Dec 30 00:07:32 fili OpenSM[3110]: /var/log/opensm.0x0002c903003f02f2.log log file opened
Dec 30 00:07:32 fili OpenSM[3107]: /var/log/opensm.0x0202c9fffe3f02f0.log log file opened
Dec 30 00:07:32 fili OpenSM[3107]: OpenSM 3.3.20
Dec 30 00:07:32 fili OpenSM[3110]: OpenSM 3.3.20
Dec 30 00:07:32 fili OpenSM[3107]: Entering DISCOVERING state
Dec 30 00:07:32 fili OpenSM[3110]: Entering DISCOVERING state
Dec 30 00:07:32 fili OpenSM[3107]: Exiting SM
Dec 30 00:07:32 fili OpenSM[3110]: Entering STANDBY state
- On both machines, A and B connected by 40Gbit Infiniband cable, having IP addresses IP_A and IP_B on interface ib0respectively, install iperf3:sudo apt install iperf3
- On machine A start iperf3 in server mode: iperf3 -B IP_A -i 3 -s- replace IP_A with IP address of interfaceib0on Machine A (sudo / root not required)
- On machine B start iperf3 in client mode: iperf3 -B IP_B -i 3 -t 15 -s IP_A- replace IP_A with IP address of interfaceib0on Machine A and replace IP_B with IP address of interfaceib0on machine B (sudo / root not required)
Output on machine B should look like:
Connecting to host 10.30.0.1, port 5201
[  4] local 10.30.0.2 port 51913 connected to 10.30.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-3.00   sec  8.24 GBytes  23.6 Gbits/sec    0   1.81 MBytes       
[  4]   3.00-6.00   sec  8.50 GBytes  24.3 Gbits/sec    0   2.81 MBytes       
[  4]   6.00-9.00   sec  9.58 GBytes  27.4 Gbits/sec    0   2.81 MBytes       
[  4]   9.00-12.00  sec  8.26 GBytes  23.6 Gbits/sec    0   2.81 MBytes       
[  4]  12.00-15.00  sec  8.26 GBytes  23.6 Gbits/sec    0   2.81 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-15.00  sec  42.8 GBytes  24.5 Gbits/sec    0             sender
[  4]   0.00-15.00  sec  42.8 GBytes  24.5 Gbits/sec                  receiver
iperf Done.
lspci | grep Mellanox
Output will look like
81:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
- 81:00.0is domain-bus-device number
- MT27500 is Mellanox part number
- ConnectX-3 indicates card supports PCI-Express 3 - actual PCI-Express version also depends on PCI-Express capabilities of your motherboard
Use domain-bus-device number obtained above
lspci -v -s 81:00.0
Output will look like:
81:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
	Subsystem: Hewlett-Packard Company InfiniBand FDR/EN 10/40Gb Dual Port 544QSFP Adapter
	Physical Slot: 5
	Flags: bus master, fast devsel, latency 0, IRQ 47, NUMA node 1
	Memory at ec100000 (64-bit, non-prefetchable) [size=1M]
	Memory at 3800fe000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at ec000000 [disabled] [size=1M]
	Capabilities: <access denied>
	Kernel driver in use: mlx4_core
	Kernel modules: mlx4_core
Subsystem: Hewlett-Packard Company InfiniBand FDR/EN 10/40Gb Dual Port 544QSFP Adapter gives further OEM details
Use domain-bus-device number obtained above
sudo lspci -vv -s 81:00.0 | grep LnkSta:
Output will look like:
		LnkSta:	Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
- Speed 8GT/s : 8 GT/s indicates PCI-Express version 3.x is being currently used for that slot
- Width x8 : indicates logical width is x8 (8 lanes)
| PCI-E version | Per lane GT/sec | Per lane MBytes/sec | 
|---|---|---|
| 1.x | 2.5 GT/sec | 250 MBytes/sec | 
| 2.x | 5 GT/sec | 500 MBytes/sec | 
| 3.x | 8 GT/sec | 985 MBytes/sec | 
| 4.x | 16 GT/sec | 1.97 GBytes/sec | 
| PCI-E Version | Per lane GT/sec | Physical | Logical | Bandwidth MBytes/sec | 
|---|---|---|---|---|
| 2.x | 5 GT/sec | x8 | x1 | 250 MBytes/sec | 
| 2.x | 5 GT/sec | x8 | x4 | 1 GBytes/sec | 
| 2.x | 5 GT/sec | x8 | x8 | 2 GBytes/sec | 
| 2.x | 5 GT/sec | x16 | x1 | 250 MBytes/sec | 
| 2.x | 5 GT/sec | x16 | x4 | 1 GBytes/sec | 
| 2.x | 5 GT/sec | x16 | x8 | 2 GBytes/sec | 
| 3.x | 8 GT/sec | x8 | x1 | 985 MBytes/sec | 
| 3.x | 8 GT/sec | x8 | x4 | 3.94 GBytes/sec | 
| 3.x | 8 GT/sec | x8 | x8 | 7.88 GBytes/sec | 
| 3.x | 8 GT/sec | x16 | x1 | 985 MBytes/sec | 
| 3.x | 8 GT/sec | x16 | x4 | 3.94 GBytes/sec | 
| 3.x | 8 GT/sec | x16 | x8 | 7.88 GBytes/sec | 
Notes:
- Physical width will never be smaller than physical width of PCI-Express device (x8 in this case)
- Logical width will never be larger than physical width
- Logical width will never be larger than actual width of PCI-Express device lane width (x8 in this case)
- PCI-Express version
- Logical slot width - may depend on configurable settings in the BIOS for your motherboard
- Maximum bandwidth for network link will be LESSER of maximum bandwidth for each of the connected machines as explored above
Put etc/sysctl.d/60-infiniband.conf under /etc/sysctl.d and reboot
Contents of etc/sysctl.d/60-infiniband.conf
# For Mellanox MT27500 ConnextX-3 (HP InfiniBand FDR/EN 10/40Gb Dual Port 544QSFP)
# Settings from https://furneaux.ca/wiki/IPoIB#Kernel_Tuning
# Originally settings from Mellanox:
#   https://community.mellanox.com/s/article/linux-sysctl-tuning
net.ipv4.tcp_timestamps=0
net.ipv4.tcp_sack=1
net.core.netdev_max_backlog=250000
net.core.rmem_max=4194304
net.core.wmem_max=4194304
net.core.rmem_default=4194304
net.core.wmem_default=4194304
net.core.optmem_max=4194304
net.ipv4.tcp_low_latency=1
net.ipv4.tcp_adv_win_scale=1
net.ipv4.tcp_rmem=4096 87380 4194304
net.ipv4.tcp_wmem=4096 65536 4194304
Only applicable to using Infiniband mode
- Put etc/systemd/system/setup_ib0.serviceunderetc/systemd/system/
- Run systemctl enable setup_ib0.serviceand reboot
Contents of etc/systemd/system/setup_ib0.service
[Unit]
Description=Setup ib0
After=sys-subsystem-net-devices-ib0.device
[Service]
Type=oneshot
ExecStart=/bin/echo connected > /sys/class/net/ib0/mode
ExecStart=/sbin/ifconfig ib0 mtu 65520
- On machine A start iperf3 in server mode: iperf3 -B IP_A -i 3 -P2 -s- replace IP_A with IP address of interfaceib0on Machine A (sudo / root not required)
- On machine B start iperf3 in client mode: iperf3 -B IP_B -i 3 -t 15 -P2 -s IP_A- replace IP_A with IP address of interfaceib0on Machine A and replace IP_B with IP address of interfaceib0on machine B (sudo / root not required)
On each side, try -P2, -P4 and -P8 to see what extracts the maximum bandwidth from the link. For me I got the maximum with -P4.
On each machine start iperf3 (in server and client modes respectively) adding the -p <port_num> option to choose a port different from the iperf3 default 5201
- HP Mellanox ConnectX Adapters Reference Guide
- Mellanox ConnectX-3 VPI Single and Dual QSFP+ Port Adapter Card User Manual
- Mellanox Software Overview - not required for this HOWTO
- Mellanox Firmware Downloads
- Mellanox - Identifying Adapter cards
- Mellanox ConnectX-3 Firmware Downloads
- Running multiple iperf3 instances
- HP Firmware for HP InfiniBand FDR/EN 10/40Gb Dual Port 544QSFP Adapter: HPE part number 649281-B21
- Mellanox - Performance Tuning for Mellanox Adapters
- Kernel network stack challenges at increasing speeds - Youtube
- OpenSM
- InfiniBand: An Inexpensive Performance Boost For Your Home Network
- Red Hat: Configuring the Subnet Manager
- Mellanox: ConnectX-3 VPI
- Ubuntu 19.04 Linux Inbox Driver User Manual
- HowTo Change Port Type in Mellanox ConnectX-3 Adapter
- Infiniband - Wikipedia
- NFS over Infiniband
- IP over Infiniband (IPoIB)
- Difference betweeb IPoIB and TCP over Infiniband: Stackoverflow
- Infiniband: Archwiki
- Mellanox: Installing Mellanox OFED
- How to Speed Test Your New Infiniband Card
- Mellanox: Linux sysctl tuning
- Search for HP 656089-001 Dual-port 40Gbit NIC on ebay.com (US)
- Search for Dual-port 40Gbit NICs on ebay.com (US)
- Search for Mellanox MC2206130-001 1-meter QSFP Passive Copper Cable on ebay.com (US)
- Search for 40 GbE cable QSFP Passive Copper Cable on ebay.com(US)
- Search for HP 764285-B21 Dual-port 10/40 GBit QSFP ConnectX-3 Adapter on ebay.com (US)
- Search for infiniband 40gb card on ebay.com (US)
- Search for MC2207130-001 1-meter QSFP Passive Copper Cable under $30 on ebay.com (US)

