SR-IOV in Docker containers


In this article, we will talk about SRIOV and how to use it in Docker containers. The SRIOV variant that we will use here is the native (or SRIOV-Flat) one. The VF devices will be moved from the default Linux namespace to the namespace of the container. We will not add any vlan-tagging on top.
So, on the switch, the packets exchanged will be the native packets that are sent/received from the containers with no extra stuff (like vlan headers) on top.

This article is useful for those who are trying to create high-performance containerized applications on a compute host. It is intended to give a clear picture of (only) how SRIOV can be used in dockers.
We will not cover any type of container orchestration methods here, and for the purpose of this article will deploy Docker containers on a bare-bone compute host (no K8s or Docker swarm, etc).
I am using a DELL server here, running Redhat Enterprise Linux 7.5.

  • First, we will enable SRIOV on 2 10G ports on a compute server.
  • Next, we will create 2 containers on the same server.
  • Next, we will move 1 VF from each of the 2 PFs in the compute host to each container.
  • Next, we will externally connect the 2 10G ports (of the compute server) via a switch.

Enable SR-IOV on 2 ports of a NIC card:

1. Find out the driver type of the NIC card’s physical network device:

[root@MyServer ~]# ethtool -i p1p1
driver: ixgbe
version: 4.4.0-k-rh7.4
firmware-version: 0x8000095d
expansion-rom-version:
bus-info: 0000:41:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

2. Find/List out the network interfaces in your system:

[root@MyServer ~]# lspci | grep Ethernet
01:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
01:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
01:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
01:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
41:00.0 Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01)
41:00.1 Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01)

3. If you are using Intel chipsets, enable IOMMU.

[root@MyServer ~]# cat /etc/default/grub
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=”$(sed ‘s, release .*$,,g’ /etc/system-release)”
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT=”console”
GRUB_CMDLINE_LINUX=”crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet intel_iommu=on
GRUB_DISABLE_RECOVERY=”true”
=> Compile the grub config after the config changes:
[root@MyServer ~]# /sbin/grub2-mkconfig -o /boot/grub2/grub.cfg
Generating grub configuration file …
Found linux image: /boot/vmlinuz-3.10.0–693.17.1.el7.x86_64
Found initrd image: /boot/initramfs-3.10.0–693.17.1.el7.x86_64.img
Found linux image: /boot/vmlinuz-0-rescue-f2841ba114734af8ac2b507172cba935
Found initrd image: /boot/initramfs-0-rescue-f2841ba114734af8ac2b507172cba935.img
done

4. Configure the ixgbe kernel driver to set the number of VF (virtual network adapter) that it should spawn up per PF (physical network adapter).

[root@MyServer ~]# cat /etc/modprobe.d/ixgbe.conf
options ixgbe max_vfs=8
=> In my case, setting the max_vfs for ixgbe drivers in kernel did not enable SRIOV. I had to also enable it in the grub settings to finally make it work.
[root@MyServer ~]# cat /etc/default/grub
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=”$(sed ‘s, release .*$,,g’ /etc/system-release)”
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT=”console”
GRUB_CMDLINE_LINUX=”crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet intel_iommu=on ixgbe.max_vfs=8"
GRUB_DISABLE_RECOVERY=”true”

5. After the configurations are done, reboot the system. Once the system is up check whether the Virtual Adapters (VF) are created (with “Virtual Function” in their description field) from each of the physical adapters (PF) (matching the driver type you configured for SRIOV: ixgbe)

[root@MyServer ~]# lspci | grep Ethernet
01:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
01:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
01:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
01:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
41:00.0 Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01)
41:00.1 Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01)

41:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
41:10.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
41:10.2 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
41:10.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
41:10.4 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
41:10.5 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
41:10.6 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
41:10.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
41:11.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
41:11.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
41:11.2 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
41:11.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
41:11.4 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
41:11.5 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
41:11.6 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
41:11.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)

6. List out the network interfaces created:

  • Use command ‘ifconfig’. Its a rather long output; you must check that there are 8 VFs (numbered as <PF>_<0–7>) for each of the PFs, apart from the already existing PF devices.
  • A snapshot of some interfaces, which we will also use later in this article looks like:
[root@MyServer ~]# ifconfig

p1p1_0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether 6a:4d:0e:87:a0:05 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

p1p2_0: flags=4099<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether fa:70:03:96:86:bf txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Create the (Docker) containers:

I will choose the “nginx:alpine” docker image as it contains network utilities like ping, ifconfig, etc already bundled-in. This will make it easier, quicker to test out the concepts we are trying out here.

[root@MyServer ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5388e03c0e95 nginx:alpine “nginx -g ‘daemon of…” 16 seconds ago Up 15 seconds 80/tcp docker02
9f6403b280f8 nginx:alpine “nginx -g ‘daemon of…” 20 seconds ago Up 18 seconds 80/tcp docker01
[root@MyServer ~]# docker exec -it docker01 /bin/sh
/ # ifconfig
eth0 Link encap:Ethernet HWaddr 02:42:AC:11:00:03
inet addr:172.17.0.3 Bcast:172.17.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:8 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:648 (648.0 B) TX bytes:0 (0.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
[root@MyServer ~]# docker exec -it docker02 /bin/sh
/ # ifconfig
eth0 Link encap:Ethernet HWaddr 02:42:AC:11:00:04
inet addr:172.17.0.4 Bcast:172.17.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:8 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:648 (648.0 B) TX bytes:0 (0.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

As expected the containers are created with eth0 (only) attached to the default docker bridge in the default network 172.17.0.0/24.


Move a VF into (the namespace of) each container:

We will do this in 2 ways. First, we will do it, for 1 container in the easy way using a fine tool called “pipework”. 
Second, we will do it, for the 2nd container in the “hard way”. Its actually the same way that “pipework” internally does it, just that this time, we will do it on our own.

  • Pipework : This is a tool (shell script) that is used for making complex networking in containers, easy.
  • Download (and install) the Pipework tool:
$ wget -O /usr/local/bin/pipework https://raw.githubusercontent.com/jpetazzo/pipework/master/pipework && sudo chmod +x /usr/local/bin/pipework
  1. The “pipework” way:
[root@MyServer ~]# ifconfig p1p1_0
p1p1_0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether 6a:4d:0e:87:a0:05 txqueuelen 1000 (Ethernet)
RX packets 1240 bytes 118898 (116.1 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1262 bytes 118580 (115.8 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[root@MyServer ~]# pipework — direct-phys p1p1_0 -i eth1 docker01 194.168.1.1/24
[root@MyServer ~]# ifconfig p1p1_0
p1p1_0: error fetching interface information: Device not found
[root@MyServer ~]# docker exec -it docker01 ifconfig
eth0 Link encap:Ethernet HWaddr 02:42:AC:11:00:03
inet addr:172.17.0.3 Bcast:172.17.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:8 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:648 (648.0 B) TX bytes:0 (0.0 B)
eth1 Link encap:Ethernet HWaddr 6A:4D:0E:87:A0:05
inet addr:194.168.1.1 Bcast:194.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1240 errors:0 dropped:0 overruns:0 frame:0
TX packets:1263 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:118898 (116.1 KiB) TX bytes:118622 (115.8 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

2. The “hard” way:

equivalent to “pipework — direct-phys p1p2_0 -i eth1 docker01 194.168.1.2/24”
[root@MyServer ~]# ifconfig p1p2_0
p1p2_0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether fa:70:03:96:86:bf txqueuelen 1000 (Ethernet)
RX packets 1241 bytes 118958 (116.1 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1242 bytes 117852 (115.0 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[root@MyServer ~]# ip_addr=194.168.1.2/24
[root@MyServer ~]# host_if=p1p2_0
[root@MyServer ~]# guest_if=eth1
[root@MyServer ~]# namespace=$(docker inspect — format=’{{ .State.Pid }}’ docker02)
[root@MyServer ~]# echo $ip_addr
194.168.1.2/24
[root@MyServer ~]# echo $host_if
p1p2_0
[root@MyServer ~]# echo $guest_if
eth1
[root@MyServer ~]# echo $namespace
45074
[root@MyServer ~]# mkdir -p /var/run/netns/ (Docker will not create a symlink to the container netns/pid by itself)
[root@MyServer ~]# ln -sfT /proc/$namespace/ns/net /var/run/netns/docker02
[root@MyServer ~]# ip netns exec docker02 ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
16: p1p2_0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether fa:70:03:96:86:bf brd ff:ff:ff:ff:ff:ff
41: eth0@if42: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:ac:11:00:04 brd ff:ff:ff:ff:ff:ff link-netnsid 0
[root@MyServer ~]# ip netns exec docker02 ip link set $host_if name $guest_if
[root@MyServer ~]# ip netns exec docker02 ip addr add $ip_addr dev $guest_if
[root@MyServer ~]# ip netns exec docker02 ip link set $guest_if up
[root@MyServer ~]# docker exec -it docker02 ifconfig
eth0 Link encap:Ethernet HWaddr 02:42:AC:11:00:04
inet addr:172.17.0.4 Bcast:172.17.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:8 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:648 (648.0 B) TX bytes:0 (0.0 B)
eth1 Link encap:Ethernet HWaddr FA:70:03:96:86:BF
inet addr:194.168.1.2 Bcast:0.0.0.0 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1245 errors:0 dropped:0 overruns:0 frame:0
TX packets:1246 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:119274 (116.4 KiB) TX bytes:118132 (115.3 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

Connect the 2x 10G ports externally

I connected the 2x 10G ports from the compute host to 2x ports on an external switch. That’s more closer to how a real deployment would be.

compute host (port p1p1) ===== (port 17) switch
compute host (port p1p2) ===== (port 18) switch

However, you can connect the 2 ports directly if you want, or in any other way you like.

Since the packets from the containers are untagged, I configured a “vman” on the switch (and added in the 2 ports 17,18 to it) to transparently carry over the packets. You can adapt your setup accordingly. 
Remember, that the containers (until this point in the article) do not have any vlan-tagging on the SRIOV interfaces, and therefore the packets arriving at the switch will be untagged Ethernet packets.


Test connectivity and verify the setup:

1. Test if the containers are able to ping each other.

[root@MyServer ~]# docker exec -it docker02 /bin/sh
/ # ifconfig eth1
eth1 Link encap:Ethernet HWaddr FA:70:03:96:86:BF
inet addr:194.168.1.2 Bcast:0.0.0.0 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1245 errors:0 dropped:0 overruns:0 frame:0
TX packets:1246 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:119274 (116.4 KiB) TX bytes:118132 (115.3 KiB)
/ # ping 194.168.1.1
PING 194.168.1.1 (194.168.1.1): 56 data bytes
64 bytes from 194.168.1.1: seq=0 ttl=64 time=0.194 ms
64 bytes from 194.168.1.1: seq=1 ttl=64 time=0.132 ms
64 bytes from 194.168.1.1: seq=2 ttl=64 time=0.134 ms
^C
— — 194.168.1.1 ping statistics — -
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.132/0.153/0.194 ms

2. Check the ARP entries to ensure that the ping packets are indeed coming from a different interface:

[root@MyServer ~]# docker exec -it docker02 arp -a
? (172.17.0.1) at 02:42:0d:20:23:b5 [ether] on eth0
? (194.168.1.1) at 6a:4d:0e:87:a0:05 [ether] on eth1
[root@MyServer ~]# docker exec -it docker01 arp -a
? (194.168.1.2) at fa:70:03:96:86:bf [ether] on eth1
? (172.17.0.1) at 02:42:0d:20:23:b5 [ether] on eth0

3. Make sure that the 2 SRIOV VF devices are in the namespace of the 2 separate containers, and are not in the “default” netns.

[root@MyServer ~]# ip link show p1p1_0
Device “p1p1_0” does not exist.
[root@MyServer ~]# ip link show p1p2_0
Device “p1p2_0” does not exist.

4. Check on the switch ports that the packets are being sent/received.
-> In Extreme switches, the output will be:

* Sw0706–020.1 # sh port 17,18 statistics
Port Statistics Thu Oct 25 01:16:09 2018
Port Link Tx Pkt Tx Byte Rx Pkt Rx Byte Rx Pkt Rx Pkt Tx Pkt Tx Pkt
State Count Count Count Count Bcast Mcast Bcast Mcast
========= ===== =========== =========== =========== =========== =========== =========== =========== ===========
tenant-c> A 1119 111706 1120 111770 1 0 1 0
tenant-c> A 1120 111770 1119 111706 1 0 1 0
========= ===== =========== =========== =========== =========== =========== =========== =========== ===========
> in Port indicates Port Display Name truncated past 8 characters
> in Count indicates value exceeds column width. Use ‘wide’ option or ‘0’ to clear.
Link State: A-Active, R-Ready, NP-Port Not Present L-Loopback
0->Clear Counters U->page up D->page down ESC->exit

Notes:

So, that’s it. You have now moved a SR-IOV VF into a container’s namespace.

  • SR-IOV provides almost line-rate performance for network packets, as the VF devices are presented as just-like-any-other PCI devices to the user plane application. In this case, its the containers. Using the default bridge takes packets from the containers through the slow Linux path by traversing through a “Docker bridge/network”.
  • PF-passthrough (or PCI-passthrough) has the disadvantage that each physical network device is exclusively given away to a container/guest machine. This will limit the number of containers that can be hosted on a compute host. Also, not all containers can generate full line rate data traffic.
  • OVS-DPDK is another option, but it still slows down the packets, due to the user-plane OVS processing.
  • SR-IOV gives the best balance here. Containers are typically limited by the number of the cpu cycles they get and the memory they use, as they are expected to be lightweight in nature. Hence, providing a container with a VF (presented as a PCI bus device ), where a single PF can create multiple such PFs, provides the best balance of utilization when it comes to optimal use of containers in a performance-driven application deployment.

Hope you found this article helpful! Happy networking…