Skip to content

Intel NTB Startup Guide

Dave Jiang edited this page May 16, 2018 · 6 revisions

Setup the Environment

BIOS configurations

The Linux driver expects the BIOS to setup the BAR for the following configuration:
(This is for Xeon E5, v2, v3, v4 and Xeon D [Broadwell based])
Split BAR enabled
BAR2/3: 4M
BAR4: 1M
BAR5: Don't care
CBDMA on and DCA off (2 locations) for using IOATDMA

Skylake does not require split BAR enable:
BAR2/3: 4M
BAR4/5: Don't care (4M or less for ntb_transport unless VTD is on)

Configuring the kernel

Enable the following config options for your kernel.

CONFIG_NTB_NETDEV=m
CONFIG_NTB=m
CONFIG_NTB_INTEL=m
CONFIG_NTB_PINGPONG=m
CONFIG_NTB_TOOL=m
CONFIG_NTB_TRANSPORT=m
CONFIG_NTB_PERF=m

If you want to use the IOATDMA (CBDMA) in NTB, enable the following option:
CONFIG_INTEL_IOATDMA=m

Enable all the CPUs if on dual socket:
CONFIG_MAXSMP=y

CMA (Continous Memory Allocator)

If BAR2 is configured greater than 4M, then the driver needs CMA to be enabled in order to allocate the properly aligned memory. Memory pointed to by the XLAT registers must be BAR size aligned. For example, an 8M memory must be 8M aligned.

CONFIG_CMA=y
CONFIG_CMA_AREAS=2
CONFIG_CMA_SIZE_MBYTES=8
CONFIG_CMA_SIZE_SEL_MBYTES=y
CONFIG_CMA_ALIGNMENT=11

The alignment is page order size, 4096 * 2^N. So an 8M alignment would be 4096 * 2^11. Sometimes Linux may not satisfy the alignment and a larger sized memory may be needed. For example, at times 8M allocation will fail and having a 16M CMA area allocated may help.

Enabling VTd can help with the large memory allocation. However, earlier BIOSes do not support NTB correctly. You need to make sure that your BIOS has the proper updates to support NTB under VTd.

NTB driver related items

NOTE: Please disable irqbalance service when running NTB with the BAR0/1 hang workaround. This applies to all Haswell and Broadwell platforms. The workaround does not handle IRQ migration. To disable irqbalance on newer Fedora:
systemctl stop irqbalance.service
Do not attempt to change the irq core affinity.

Also, add the following command line parameter to your kernel boot line in grub. acpi_irq_nobalance

Loading the driver

The NTB subsystem has 4 kernel modules:

  1. ntb.ko - the common NTB hardware driver glue
  2. ntb_hw_intel.ko - the Intel NTB hardware driver
  3. ntb_transport.ko - the NTB transport
  4. ntb_netdev.ko - the driver that expose NTB as a virtual NIC

Once the kernel is installed, typically ntb_hw_intel.ko will be automatically loaded via the PCI device ID. It will load ntb.ko as the dependency. One can load the ntb_netdev.ko and it will automatically load the ntb_transport.ko. There are some kernel parameters for ntb_transport though that may be of interest.

transport_mtu: size of MTU for the transport. Note, you cannot configure the MTU via the Ethernet device. By default this is set to 64k and seem to provide the best throughput performance.

copy_bytes: Threshold under which the NTB will use the CPU to copy instead of using the DMA engine in order to address latency. By default this is set at 1k. Setting this to 0 will force all DMA copies.

use_dma: This parameter enables using DMA for data transport. By default this is turned off. We have found that depending on your CPU SKUs, CPU copy only with write-combining outperforms DMA.

Root Port Mode / Transparent Bridge config

Unless BIOS has support that enables the NTB on the RP side on certain platform configurations, this is what you need to do to load the NTB driver. In order for the TB side to see the PCIe device, NTB link must be enabled on the RP side. To do so, load the ntb_hw_intel kernel module on the RP side. Then reboot the TB side in order for the device to show up. After that, load all the other modules as normal. HOWEVER, due to the NTB workaround and how it is implemented, this is a one shot deal. You cannot reload the NTB drivers unlike B2B configuration.

modprboe ntb
modprboe ntb_transport
modprobe ntb_hw_intel
modprobe ntb_netdev

Usage of ntb_dev

Once the driver loaded, the interface can be brought up using typical network management tools. Configure unique private IP addresses on either side.

For example, on local host ifconfig eth0 192.168.0.1 and on remote host ifconfig eth0 192.168.0.2

You should see something like this in the dmesg:
[ 345.165195] Intel(R) PCI-E Non-Transparent Bridge Driver 2.0
[ 345.165317] ntb_hw_intel 0000:00:03.0: Reduce doorbell count by 1
[ 345.165433] ndev_spad_write: NTB unsafe scratchpad access
[ 345.165465] ntb_hw_intel 0000:00:03.0: NTB device registered.
[ 345.169869] Software Queue-Pair Transport over NTB, version 4
[ 345.172829] ntb_hw_intel 0000:00:03.0: NTB Transport QP 0 created
[ 345.173027] ntb_hw_intel 0000:00:03.0: eth0 created
[ 346.179707] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 351.229314] ntb_hw_intel 0000:00:03.0: qp 0: Link Up
[ 351.229323] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

When eth0 says its link is ready, that means you can start communicating such as using ping.

ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 65510
		inet 192.168.0.1  netmask 255.255.255.0  broadcast 192.168.0.255
		inet6 fe80::a84d:b0ff:fe85:5b19  prefixlen 64  scopeid 0x20<link>
		ether aa:4d:b0:85:5b:19  txqueuelen 1000  (Ethernet)
		RX packets 8  bytes 648 (648.0 B)
		RX errors 0  dropped 0  overruns 0  frame 0
		TX packets 8  bytes 648 (648.0 B)
		TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

NOTE: The MAC address for NTB is random generated. So therefore if the driver is reloaded too quickly, it is possible for the ARP cache to be stale and that can prevent network traffic until ARP packets are exchanged and new route is establish. If this is the case, the old ARP entry can be deleted via the arp tool.

NOTE: Please disable firewall before network testing with iperf.

Performance Tuning

With the recent CPUs there are a lot of power management configurations. In order to achive best performance, especially with CPU copy NTB, the CPUs must be configured to be in performance mode. This can be done via the cpupower tool in the kernel-tools package.
/usr/bin/cpupower frequency-set --governor performance

If the system is a dual socket system, one must pay attention to the NUMA configuration in order to get best performance. On a dual socket system, you may see kernel log such as this:

[    0.000000] NUMA: Initialized distance table, cnt=2  
[    0.000000] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl  
[    0.910777] pci_bus 0000:00: on NUMA node 0  
[    0.915052] pci_bus 0000:80: on NUMA node 1  

This indicates that there's a proper NUMA config table and that Linux is setting up the PCI busses to specific NUMA nodes. The numactl package should be installed. This allow us to examine the numa configuration and also launch processes according to NUMA configuration.

numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
node 0 size: 15926 MB
node 0 free: 15579 MB
node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27
node 1 size: 16117 MB
node 1 free: 15882 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

In this setup, it shows that there are 2 sockets and what memory and cores are assigned to which socket. And each socket has a NTB device. In order to get the best performance, any process using the NTB must utilize a local core and local memory. The NTB driver is already NUMA aware and will pick the correct DMA channel in the socket desired.

For example, if we are to use the iperf application to measure performance. We will have 2 sides of NTB and we will call it side A and side B. On side A we will have the iperf server and on side B we will have the iperf client.

Side A, we start iperf server and binds it to a core on socket 0 with memory on socket 0:
numactl -m 0 -N 0 iperf -s

Side B, we start the iperf client and binds it to a core on socket 0 with memory on socket 0:
numactl -m 0 -N 0 iperf -c 192.168.0.1 -P4 -t60 -i2

Note: Typically the processing of receive and transmit are done on the same core. Under high stress traffic, fluctuation in iperf may be observed due to transmit starving receive processing. To even out the performance, it may be a good idea to run iperf on a different core (on the same NUMA node) other than core 0. This can be done via explicitly setting the CPU mask when running numactl and exclude core 0.

Debugging

If there are issues there are several things that can be done. First, take a look at the network device stat via ifconfig and make sure that there are tx and rx packets when the network is up.

Check the debugfs stats for the hardware driver and the transport queues. The hardware driver stats is located at /sys/kernel/debug/ntb_hw_intel/*/info where * is the PCIe device. For example:

cat /sys/kernel/debug/ntb_hw_intel/0000\:00\:03.0/info 
NTB Device Information:
Connection Topology -   NTB_TOPO_B2B_USD
B2B Offset -            0x0
B2B MW Idx -            2147483647
BAR4 Split -            yes
NTB CTL -               0x0000
LNK STA -               0xa103
Link Status -           Up
Link Speed -            PCI-E Gen 3
Link Width -            x16
Memory Window Count -   1
Scratchpad Count -      16
Doorbell Count -        14
Doorbell Vector Count - 4
Doorbell Vector Shift - 5
Doorbell Valid Mask -   0x3fff
Doorbell Link Mask -    0x8000
Doorbell Mask Cached -  0x3fff
Doorbell Mask -         0x3fff
Doorbell Bell -         0x0

NTB Incoming XLAT:
XLAT23 -                0x000000087f000000
XLAT4 -                 0xfee00000
XLAT5 -                 0x0000
LMT23 -                 0x0000000000000000
LMT4 -                  0xa0100000
LMT5 -                  0xc0000000

NTB Outgoing B2B XLAT:
B2B XLAT23 -            0x2000000000000000
B2B XLAT4 -             0x20000000
B2B XLAT5 -             0x40000000
B2B LMT23 -             0x0000000000000000
B2B LMT4 -              0x0000
B2B LMT5 -              0x0000

NTB Secondary BAR:
SBAR01 -                0x000000000000000c
SBAR23 -                0xa00000000000000c
SBAR4 -                 0xa0000000
SBAR5 -                 0xc0000000

XEON NTB Statistics:
Upstream Memory Miss -  0

XEON NTB Hardware Errors:
DEVSTS -                0x0000
LNKSTS -                0xa103
UNCERRSTS -             0x100000
CORERRSTS -             0x2000

Mainly we want to make sure that the link is up and the register values look sane.

The transport stats is located in /sys/kernel/debug/ntb_transport/*/qp0/stats where * is the PCIe device:

cat /sys/kernel/debug/ntb_transport/0000\:00\:03.0/qp0/stats 

NTB QP stats:

rx_bytes -      648
rx_pkts -       8
rx_memcpy -     8
rx_async -      0
rx_ring_empty - 16
rx_err_no_buf - 0
rx_err_oflow -  0
rx_err_ver -    0
rx_buff -       0xffff88087f000000
rx_index -      8
rx_max_entry -  63

tx_bytes -      648
tx_pkts -       8
tx_memcpy -     8
tx_async -      0
tx_ring_full -  0
tx_err_no_buf - 0
tx_mw -         0xffffc90004800000
tx_index (H) -  8
RRI (T) -       7
tx_max_entry -  63
free tx -       62

Using TX DMA -  No
Using RX DMA -  No
QP Link -       Up

We want to make sure that the queue pair link is up as well and the queue stats look sane and no qp errors. Here we can also see whether DMA is being used or not.

Also, one can turn on dynamic debug for the kernel modules. Specifically the debugging statements for hardware driver and transport. You can enable them by adding "dyndbg=+p" to the module parameter when loading the kernel modules. The output will be displayed in dmesg. For example:

modprobe ntb_hw_intel dyndbg=+p

[18865.135842] Intel(R) PCI-E Non-Transparent Bridge Driver 2.0
[18865.136030] ntb_hw_intel 0000:00:03.0: ppd 0x49 topo NTB_TOPO_B2B_USD
[18865.136034] ntb_hw_intel 0000:00:03.0: PPD 73 split bar
[18865.136035] ntb_hw_intel 0000:00:03.0: ppd 0x49 bar4_split 1
[18865.136037] ntb_hw_intel 0000:00:03.0: Reduce doorbell count by 1
[18865.136038] ntb_hw_intel 0000:00:03.0: not using b2b mw
[18865.136040] ntb_hw_intel 0000:00:03.0: PBAR23SZ 0x16
[18865.136042] ntb_hw_intel 0000:00:03.0: SBAR23SZ 0x16
[18865.136044] ntb_hw_intel 0000:00:03.0: PBAR4SZ 0x14
[18865.136046] ntb_hw_intel 0000:00:03.0: SBAR4SZ 0x14
[18865.136048] ntb_hw_intel 0000:00:03.0: PBAR5SZ 0x14
[18865.136050] ntb_hw_intel 0000:00:03.0: SBAR5SZ 0x14
[18865.136051] ntb_hw_intel 0000:00:03.0: SBAR01 0x0000000000000000
[18865.136054] ntb_hw_intel 0000:00:03.0: SBAR23 0xa00000000000000c
[18865.136055] ntb_hw_intel 0000:00:03.0: SBAR4 0xa0000000
[18865.136057] ntb_hw_intel 0000:00:03.0: SBAR5 0xc0000000
[18865.136059] ntb_hw_intel 0000:00:03.0: SBAR23LMT 0xa000000000000000
[18865.136061] ntb_hw_intel 0000:00:03.0: SBAR4LMT 0xa0100000
[18865.136063] ntb_hw_intel 0000:00:03.0: SBAR5LMT 0xc0000000
[18865.136065] ntb_hw_intel 0000:00:03.0: SBAR4XLAT 0xfee00000
[18865.136067] ntb_hw_intel 0000:00:03.0: SBAR5XLAT 0x000
[18865.136070] ntb_hw_intel 0000:00:03.0: PBAR23XLAT 0x2000000000000000
[18865.136072] ntb_hw_intel 0000:00:03.0: PBAR4XLAT 0x20000000
[18865.136073] ntb_hw_intel 0000:00:03.0: PBAR5XLAT 0x40000000
[18865.136074] ntb_hw_intel 0000:00:03.0: B2BXLAT 0x0000000000000000
[18865.136076] ntb_hw_intel 0000:00:03.0: BAR 4
[18865.136077] ntb_hw_intel 0000:00:03.0: BAR addr: 0x91c00000
[18865.136078] ntb_hw_intel 0000:00:03.0: BAR size: 0x100000
[18865.136083] ntb_hw_intel 0000:00:03.0: BAR vaddr: ffffc90004400000
[18865.136219] ntb_hw_intel 0000:00:03.0: Using msix interrupts
[18865.136221] ntb_hw_intel 0000:00:03.0: local lower MSIX addr(0): 0xfee00000
[18865.136222] ntb_hw_intel 0000:00:03.0: local MSIX data(0): 0x404d
[18865.136224] ntb_hw_intel 0000:00:03.0: local lower MSIX addr(1): 0xfee00000
[18865.136225] ntb_hw_intel 0000:00:03.0: local MSIX data(1): 0x405d
[18865.136226] ntb_hw_intel 0000:00:03.0: local lower MSIX addr(2): 0xfee00000
[18865.136228] ntb_hw_intel 0000:00:03.0: local MSIX data(2): 0x406d
[18865.136229] ntb_hw_intel 0000:00:03.0: local lower MSIX addr(3): 0xfee00000
[18865.136230] ntb_hw_intel 0000:00:03.0: local MSIX data(3): 0x407d
[18865.136232] ndev_spad_write: NTB unsafe scratchpad access
[18865.136284] ntb_hw_intel 0000:00:03.0: NTB device registered.

modprobe ntb_transport dyndbg=+p
modprobe ntb_netdev

[18865.141346] Software Queue-Pair Transport over NTB, version 4
[18865.141378] ntb_transport 0000:00:03.0: doorbell is unsafe, proceed anyway...
[18865.141380] ntb_transport 0000:00:03.0: scratchpad is unsafe, proceed anyway...
[18865.141426] ntb_hw_intel 0000:00:03.0: Enabling link with max_speed -1 max_width -1
[18865.145908] ntb_hw_intel 0000:00:03.0: Using CPU memcpy for TX
[18865.145911] ntb_hw_intel 0000:00:03.0: Using CPU memcpy for RX
[18865.145942] ntb_hw_intel 0000:00:03.0: NTB Transport QP 0 created
[18865.146207] ntb_hw_intel 0000:00:03.0: eth0 created
[18865.179805] ntb_hw_intel 0000:00:03.0: vec 3 vec_mask f8000
[18865.199385] ntb_hw_intel 0000:00:03.0: Remote version = 4
[18865.199388] ntb_hw_intel 0000:00:03.0: Remote max number of qps = 1
[18865.199389] ntb_hw_intel 0000:00:03.0: Remote number of mws = 1
[18865.199391] ntb_hw_intel 0000:00:03.0: Remote MW0 size = 0x400000
[18866.155540] ntb_hw_intel 0000:00:03.0: ntb_transport_rxc_db: doorbell 0 received
[18866.155543] ntb_hw_intel 0000:00:03.0: qp 0: RX ver 0 len 0 flags 0
[18866.155544] ntb_hw_intel 0000:00:03.0: done flag not set
[18866.155709] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[18866.155820] ntb_hw_intel 0000:00:03.0: Remote QP link status = 404d
[18866.155822] ntb_hw_intel 0000:00:03.0: qp 0: Link Up
[18866.155836] ntb_hw_intel 0000:00:03.0: ntb_transport_rxc_db: doorbell 0 received
[18866.155838] ntb_hw_intel 0000:00:03.0: qp 0: RX ver 0 len 0 flags 0
[18866.155839] ntb_hw_intel 0000:00:03.0: done flag not set
[18866.155840] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

NTB Performance Measuring Tool

The NTB perf tool resides in the test directory attempts to measure as close to HW performance to show possible performance on NTB with minimal software interaction. Although the CPU copy performance measured is very close to bare metal, the DMA performance cannot reach similar type of performance due to having to use the Linux DMA driver.

The current perf driver only supports single direction performance measurement. Of course if you run the test long enough theoretically you can run tests on both sides at the same time and interpret the results from each side.

In order to get things started, both sides need to have ntb_perf loaded. The defaults set by the perf driver can be used to quickly detect if things are working properly but a longer test is suggested to get consistent results.

In dmesg (On Haswell platform) after both sides loaded:
Intel(R) PCI-E Non-Transparent Bridge Driver 2.0 ntb_hw_intel 0000:00:03.0: Reduce doorbell count by 1 ndev_spad_write: NTB unsafe scratchpad access ntb_hw_intel 0000:00:03.0: NTB device registered.

Start the run:
echo 1 > /sys/kernel/debug/ntb_perf/XXXX:XX:XX.X/run

In dmesg:
kthread ntb_perf 0 starting... ntb_perf 0: copied 17179869184 bytes ntb_perf 0: lasted 1691968 usecs ntb_perf 0: MBytes/s: 10153

Kernel module configurable parameters:
seg_order: size order [n^2] of buffer segment for testing (uint)
run_order: size order [n^2] of total data to transfer (uint)
use_dma: Using DMA engine to measure performance (bool)

Both seg_order and run_order can be changed after module load. use_dma must be set upon module load and cannot be changed.

Default:
seg_order = 19 (default 512Kbytes, max 1Mbytes)
run_order = 32 (default 4Gbytes)
use_dma = 0

Additional configurable parameters in debugfs:
/sys/kernel/debug/ntb_perf/XXXX:XX:XX.X/
threads: Number of concurrent threads to run. (Default 1, max 32).
run: Start or stop the test.

References

NTB kernel documentation

Intel NTB git repo
branch davejiang/ntb

NTB Wiki

NTB Testing Howto