Using namespaces on HAProxy to segregate your traffic

Published in

Criteo Tech Blog

7 min readApr 28, 2020

Our Criteo infrastructure is managing millions of requests coming from the outside world. Those requests are handled by our HAProxy servers which are hosted on our commodity hardware available for all teams. In this article, we will have a closer look at how we use HAProxy namespacing to segregate our public traffic.

Machine management

Our machines are installed using Chef which is retrieving a set of recipes made by different teams in our R&D organization. One of the challenges we were trying to tackle lately was around this close interaction between our public traffic and a set of code we don’t directly own in our network load balancer team. Some daemons are maintained by other teams, such as collected to gather metrics, Consul to register your services, or any common service you can think of that takes care of a machine on a big infrastructure. Each team benefits from each other's teamwork automatically with regular automatic updates. If we go back to our load balancer the setup, it was as follow:

It means that any daemon present on the machine could easily bind a VIP present on the tunnel interface. This scenario could be scary in the context of a daemon being updated by another team where a faulty code could be pushed which would start to listen to this traffic. We can also imagine multiple other scenarios of a malicious daemon.
A higher overview of the traffic flow would look like:

Our traffic is coming from the L4 load balancers which are doing tunnel encapsulation in order to be able to do direct return from the HAProxy to the outside world.

Segregate our public traffic

We want to make sure only allowed daemons can bind the public IP we are using on those machines. That’s where we looked at the namespace feature on HAProxy. One interesting point is that it allows putting the specified socket in a namespace; meaning that we can simply isolate our bind lines and keep the server lines intact:

bind 10.0.0.0:80 name http_ip4 process 1/all tfo namespace haproxy
bind 10.0.0.0:443 name https_ip4 process 1/all ssl crt /etc/HAProxy/tls/ alpn h2,http/1.1 tfo allow-0rtt namespace haproxybind fe80:0::19:80 name http_ip6 process 1/all tfo namespace haproxy
bind fe80:0::19:443 name https_ip6 process 1/all ssl crt /etc/haproxy/tls/ alpn h2,http/1.1 tfo allow-0rtt namespace haproxy

In order to implement that on our tunneling setup, we first looked at the technical blog posts to get a refreshing overview of how things were done. An interesting point we found was that a lot of examples showed this kind of setup:

This complex setup would make use of inter namespace communication with veth pair interfaces, sending all the traffic public through it. Alternatives proposed to use similar devices such as macvlan or any other devices which allow you to do inter namespace communication. Moving the whole HAProxy within the HAProxy namespace could have been a possibility but we still needed to be able to connect to our backend servers and let other OS daemons; it would have required either double physical interfaces or a more advanced setup making use of network card virtualization such as SR/IOV. We weren’t satisfied with these options as they would potentially add unnecessary overhead for a simple problem we wanted to resolve. Another more interesting point was that we weren’t able to move the tunnel interface from the root namespace to the HAProxy namespace as it is possible for the veth interfaces to achieve such cross namespace communication:

# ip link add veth0 type veth peer name veth1
# ip link set veth1 netns HAProxy
# ip -n HAProxy link show veth1
17: veth1@if18: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
  link/ether aa:15:11:92:55:c5 brd ff:ff:ff:ff:ff:ff link-netnsid 0
# ip link set gre0 netns HAProxy
RTNETLINK answers: Invalid argument

The reasons might be because of the existing interface which are created by default in each namespace, because of the tunnel modules:

# ip netns add foo
# ip -n foo link show gre0
2: gre0@NONE: <NOARP> mtu 1476 qdisc noop state DOWN mode DEFAULT group default qlen 1000
  link/gre 0.0.0.0 brd 0.0.0.0

We then discovered the sysctl setting named `net.core.fb_tunnels_only_for_init_net` which allow disabling the creation of those default interfaces (by the way, they are named “fb” interfaces, for fallback interfaces, as they were historically present in the drivers' code, and are selected as a last resort during the packet interface lookup — writing this here as I had a hard time understanding the is “fb” naming while reading tunnel driver code).

# sysctl -w net.core.fb_tunnels_only_for_init_net=1
net.core.fb_tunnels_only_for_init_net = 1
# ip netns add foo
# ip -n foo link show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
  link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# ip link set gre0 netns foo
RTNETLINK answers: Invalid argument

Still no luck. To better understand the difference between the two interfaces, we had to dig deeper into the features of the different devices. A simple way to look at it is to use ethtool such as:

# ethtool -k veth0
Features for veth0:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: off [fixed]
tx-checksum-ip-generic: on
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: on
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: on
tx-tcp-mangleid-segmentation: on
tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: on [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: off [fixed]
tx-sctp-segmentation: on
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: on
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]

An interesting point to note here is `netns-local: off [fixed]` feature which is different on the gre0 interface:

# ethtool -k gre0 | grep netns
netns-local: on [fixed]

It seems to indicate we won’t be able to move this interface anyway. A deeper look in the kernel code seems to show a flag “NETIF_F_NETNS_LOCAL” was explicitly set on tunnel interfaces:

if (!IS_ERR(itn->fb_tunnel_dev)) {
  itn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL;
  itn->fb_tunnel_dev->mtu = ip_tunnel_bind_dev(itn->fb_tunnel_dev);
  ip_tunnel_add(itn, netdev_priv(itn->fb_tunnel_dev));
  itn->type = itn->fb_tunnel_dev->type;
}

This is precisely when we understood fallback interfaces were kind of special as a lot of code seems to add specific code for them. Now creating a new gre interface was more successful; the only thing you need to create a similar tunnel interface is to have one different parameter, such as the interface linked to it (which is none is the case of the fallback interface):

# ip link add gre1 type gre dev eth0
# ethtool -k gre1 | grep netns
netns-local: off [fixed]
# ip link set dev gre1 netns haproxy
# ip -n haproxy link show gre1
7: gre1@if2: <NOARP,UP,LOWER_UP> mtu 1476 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
  link/gre 0.0.0.0 brd 0.0.0.0 link-netnsid 0

Notice the interesting point here, we now have an explicit link to the root namespace with “link-netnsid 0”. Our issue was finally coming to a satisfactory result: being able to move public traffic in a dedicated namespace, ensuring other local daemons could not see this traffic unless explicitly set by HAProxy configuration; all of this without too much overhead. Note that we arbitrary choose ipvlan module for the outgoing traffic which seemed to be a very simple driver routing traffic directly to the switch, without triggering a lookup in the root namespace:

Conclusion

That was an interesting journey that allowed us to have a deeper understanding of several Linux components, to fix some long standing bugs and improve others parts in the Linux kernel, for all the different types of tunneling we use on our infrastructure. We indeed use `ipip` tunneling or `gre` depending on devices that support it. All the necessary plumbing is also already in place for the day our devices registers themselves in IPv6 in our service catalog. It was a good occasion to test all possibilities and fix the failing cases. We hope to write a follow-up post in the next months for the day we will be able to talk more about our IPv6 deployment within our datacenter.

Want to read more on this topic? What about this…

State of load balancing at Criteo

Like many internet companies, Criteo is actively driven by its capacity to scale its infrastructure and answer requests…

medium.com

Thanks for reading! We hoped you enjoyed the insight. If you are interested in joining the journey, check out our career opportunities below.

Product, Research & Development | Criteo Careers

Product, Research & Development at Criteo. At Criteo, come and meet our teams and join our R & D and also enjoy…

careers.criteo.com