Where containers got networking wrong.

The Trouble with Service Mesh

John Boero
HashiCorp Solutions Engineering Blog
17 min readJul 21, 2021

--

Along with Transparent Proxy, an exciting minor feature lands with the Consul 1.10 release and I’m happy to say it’s been on my wish list since HashiCorp entered the service mesh space by adding Connect to Consul. Since that time a lot of options have emerged around the mesh industry of secure and opaque connectivity between services and orchestrated containers. In my mind this feature helps to get services and containers back on track where they went wrong almost 12 years ago with LXC and eventually Docker. I’m talking about UNIX domain sockets or UDS.

Photo by The Creative Exchange on Unsplash

Containers and Coffee Shop Wifi

If you think back to the first time you tried Docker or LXC, it was remarkable to get an isolated environment up in seconds with its own filesystem and IP address without needing to start a VM and an OS. Legacy chroots never isolated networking. If two developers need to run services on port 443 then just give them each a container with its own IP and let them use whatever ports they want via network namespaces and virtual interfaces. You might compare your container host to a virtual coffee shop with a wifi router performing NAT on its single public IP to give everybody a private IP and masquerade their outgoing requests. This seems simple enough but what if you need to send a message directly to someone in the coffee shop next door which has its own network? What if someone needs to send your IP a message or ask a question from outside of any coffee shop? Imagine the physical instance where you need to grab a pen and fill out a TCP/IP header form to communicate just a ~1500 byte MTU (Max Transmission Unit) segment of your message to someone else in another network. Each form would look something like this:

Imagine you fill out this form for every KB of that 500MB YouTube video you’re uploading.

On WiFi with poor connection quality, this might be helpful since missed packets need to be resent entirely via TCP retry if there is any defect detected, but that’s not normally an issue between most servers. For background on Consul traffic routing, Christoph Puhl has a fantastic write-up and talk on The Life of a Packet through a service mesh in a container environment. Also here is a great history of the 1500 MTU.

Problem

These are some of the networking problems faced by container orchestration and simple NAT networks everywhere. IPC or Interprocess Communication is great for communicating within a single host but bridging that to another host is tricky. It’s handy for every container to have its own IP address because most legacy apps with a TCP listener can be easily containerized but how do we pair up all of these private/public IPs and connect them only when necessary? The answer comes in the form of various proxies and overlay networks that use random DHCP IPs and port numbers. You might think of container or pod communication like this:

  1. Pod to service: ClusterIP or NodePort, kube-proxy.
  2. Pod to pod, same host: shared bridge.
  3. Pod to pod, different host: overlay network or proxy.
  4. Pod to external (egress): bridge masquerade (NAT).
  5. External to Pod (ingress): LB or NodePort.

If that’s not complicated enough, have a look at iptables on a Kubernetes host running kube-proxy and an overlay network like Calico or Flannel. The good news is that routing these packets via the kernel’s internal iptables gives good performance presenting a traditional IPv4 address to container developers. All the kernel needs to do is flip a few bits and checksum the packet header as long as the packet is within all applicable MTU limits. There is no need to copy a full packet into an application in user space to process communications. The bad news is that all of this complex traffic doesn’t scale well. In fact performance highlighted the need to completely replace the kernel’s internal iptables structures with nftables. A Linux machine running 500 containers managing complex routing with hundreds of internal/external IP addresses and possibly external load balancers started to push the limits of what iptables was originally intended for. In essence the complex IPv4 arrangements required for large scale Linux container workloads helped drive major changes to the kernel itself. Luckily nftables was built to be backwards compatible with classic iptables management casual users didn’t notice.

The insecure global nature of TCP/IP networking means every container gets its own interface.

Note that all of this network routing via the Linux kernel has only enabled connectivity. That says nothing about encryption or ACL protection. In fact most overlay networks give each node a whole new bridge which must use a unique subnet within the network. This usually allows broad network-wide connections whether you want them or not: all or nothing. Also, don’t forget encryption is still up to you and your application. If you use a proxy or terminate TLS outside of this the TCP flow will most likely be desegmented and re-assembled across the network, duplicating work at TCP layer 4.

Overlays show just how needlessly complex container networking has become. This doesn’t even cover access control or encryption. Source

If all of the above has bored you with tedium then I’ve achieved exactly what I wanted to. When you use TCP on your IP to communicate there is a lot of overhead involved — especially with containers. Often times your message gets wrapped into TCP or UDP frames, passed around to multiple inspectors, and then returned after it was determined that the destination was actually your local loopback adapter all along. Essentially the kernel has a virtual network adapter for localhost communications just for local IPC using TCP. Then there’s MTU, which dictates the maximum size of a packet and defaults to 1500 in most interfaces. Loopback often uses a large MTU of 64000 to submit more at a time but that must be reduced or packets restructured when meeting an interface with lower MTU, which costs CPU time. Considering 10Gb and 100Gb networking, this is why we’re seeing a renewed interest in smart NICs and TCP offload for flow control, encryption, and compression. Note that as loopback is a software interface in the kernel, local traffic probably won’t benefit from any offloading on your fancy smart NIC.

Orchestrator networking can be inefficient.

UNIX Domain Sockets

UNIX Domain Sockets or UDS present a much simpler local in-memory IPC mechanism that solves many of the problems presented by loopback TCP sockets. If you’re running a Microsoft house this hasn’t been an option historically. This type of socket wasn’t introduced into Windows until 2017, and it’s a long overdue feature that has been standard equipment among POSIX environments. One of the reasons I think applications have relied so much on loopback TCP is that Windows applications needed to be built with TCP in mind as they had no alternative. Along with UNIX and Linux traditions of everything being a file, domain sockets are represented alongside files in a filesystem as a stream. They can be listened to or connected to just like a local port but with several advantages:

  1. No TCP. UDS doesn’t need to segment data into MTU-sized packets. Instead UDS socket communication usually operates natively and at the full speed of system memory with relatively low CPU usage.
  2. No IP. It seems insignificant but your container doesn’t need to wait for DHCP to give it an IP address on a container bridge. It doesn’t seem like a huge time saver but Docker can actually run out of IPs in its default DHCP IPv4 pool which is usually a class B network. This scale of deployment is an example of why iptables needed to be replaced with nftables in the kernel. Keeping track of up to ~32k container IPs and routes including DHCP leases has significant overhead at scale. Instead of requesting a dynamic IP and port from a pool, applications can say exactly where they want a socket descriptor.
  3. Unique. Sockets are represented by a unique Inode with a stat description of type “socket”. Thousands of containers can each have a listening socket called /path/mysocket.sock each under a unique Inode and not worry about IP address or port conflicts. No Layer 4 clashes means nobody needs to worry about someone else using their port, thus they also don’t need to worry about having their own private IP address or interface.
  4. Secure. Since UDS are represented within the file system by a stat description they are protected by all the normal system privilege mechanisms. Ownership and group ownership are honored. Who can read? Who can write? What if you want to give access to a different group or hand off permission? All of this is possible, including extended attributes for things like SELinux and AppArmor xattrs. No ACL is required. If a socket is in your container’s root file system then nobody blocked from that path can access it. By contract IP:PORT is global to an environment with no concept of ownership. This is why containers tend to have their own virtual interface and why applications evolve complex proprietary ACLs with token authentication.
  5. Pure speed. With all of the above resolved, you save a significant amount of overhead until you need to leave your host. Anybody can easily benchmark their system. The simple test I use is just a few hundred lines of C by Erik Rigtorp: https://github.com/rigtorp/ipc-bench. This can test your system’s max latency and throughput for various transmission sizes and the difference in my own sample is significant; Not just 4% or 5% faster but 4.7X faster throughput (9,070Mb/s vs 1,927Mb/s) on small transmissions of 4KB. Note that is megaBITS, not megaBYTES. Small messages like microservice REST API calls are a typical representation here where that inefficiency can add up. Latency is a bit closer with 1.6X speed gain (14,799ns vs 23,817ns) as we’re never touching the actual network. Maybe ~8μs doesn’t mean much to you but if you’re an HFT or HPC team that can make a difference. In either case the bandwidth is the big winner and don’t forget this is only the local loopback adapter, so this is just communicating within memory using TCP. This does not include the overhead of switching packets across virtual bridges and external networks. Docker NAT or DNAT is not free. The benchmarks below are from my local machine which is fairly dated Intel Haswell and hefty but slower DDR4 LRDIMMs. You should absolutely benchmark your own environment to get a feel for it. Repeating the same tests on a cloud VM with latest gen Intel or AMD Epyc yields me up to 90Gb/s throughput with 32KB UDS tests (vs my ~55Gb/s) but suffers similar bottlenecks with TCP. Most modern environments have DMA offloading optimizations in memory controllers which can leave the CPU free to do other tasks and in some cases the CPU doesn’t even need to touch memory itself during UDS communication. Imagine taking a step further and implementing service connectivity with RDMA (Remote DMA), which can support this across Infiniband networks (thanks to Jonathan Vermeulen for the brainstorm).

UDS can give a ~5x speed increase without maximizing CPU compared to a maxed out CPU frantically trying to packetize or even repacketize TCP flows while your application waits.

Raw local latency and throughput are both drastically better under UDS than loopback TCP.

I can think back to my days as a DBA and the sound advice around MySQL listening on UNIX sockets by default. Local database performance is much better with local sockets and you should only use a TCP socket if you’re connecting externally. Loopback TCP or 127.0.0.1:3306 with MySQL is pointless as you may as well just be using UNIX sockets. In fact our friends over at Percona have a great writeup on this topic. Encryption and data manipulation will require CPU intervention unless they are supported by your smart NIC or proprietary hardware. This means you probably won’t see full native UDS performance end-to-end with Envoy proxy but you also won’t have the added overhead of TCP just talking to the proxy itself. Instead TCP will only be used when necessary.

Effects on Consul

At this point we’ve shown how local communication is much better under UNIX domain sockets, but what good is that if we can’t leave the host? In most cases we will still need to send something from a container to another host over TCP. The good news is that Consul Connect and Envoy Proxy can bridge UDS to TCP connections. In fact you can do this on your own using the simple socat command. Reusing the MySQL example, we can listen on port 3306 and forward it to the /var/lib/mysq/mysql.sock UDS:

socat TCP-LISTEN:3306 UNIX-CONNECT:/var/lib/mysql/mysql.sock

Note that the user running this command must have permissions on that socket and this actually circumvents all ownership security on the socket since TCP has no concept of permissions. It also doesn’t encrypt anything! In other words, be careful how you use it.

Consul on the other hand automatically manages all encryption via mTLS using intentions for policy, and now it can connect UNIX Domain Sockets. It also works across heterogeneous environments and isn’t just tied to Kubernetes or containers. This means that every service or container you create can have its own secure /path/mysql.sock or /path/microservice.sock for plain text read/write even without an IP address or network. There is no need to look up whatever dynamic IP or port we were allocated because there is only our local socket file which we can name whatever we want. Forget troubleshooting complex tcpdumps and pcap files when configuation mistakes are made because there is no virtual coffee shop network within our host. Consul will handle the TLS encryption and authentication for you and it will even find the right network route. Instead of the convoluted architecture where containers need a private IP +bridge, overlay, proxy, and loopback adapter, a container can be a simple protected directory with no network.

Use UNIX domain sockets for fast and simple IPC communication in your apps. Let Consul sort out the rest for you including automatic mTLS networking between containers, hosts, bare metal, VMs, heterogeneous environments, Cisco and F5 devices, ingress, and more.

UDS offers a much simpler and faster option for service communication. Just one IP per node.

How does this look within Consul? Since the 1.10beta2 release I can test this with a simple proof of concept. To keep things simple I’ll use a local MySQL container listening on the UDS as described above. It doesn’t need to be a container but it helps me go full Houdini straight jacket and not use TCP. I’ll completely skip networking on my container so it has no IP address public or private. As I’m running this locally on Fedora and Podman, the way to start and register a Consul service is as follows:

  1. Run MySQL server listening at /run/mysql/mysqld.sock
  2. Ensure to mount a shared host volume for /run/mysql/.
  3. Register a service on the host with Consul’s new options.

This took a bit of experimenting as there are additional restrictions on UNIX Domain Sockets with regards to Podman and kernel contexts. These restrictions exist by design for security. Your host can see the socket among the container backing filesystem but you can’t access the socket directly. Instead use a shared volume on the host for where your sockets will run. I’ll mount them in my host’s /tmp/mysql directory for example and I’ll run the container interactively to show how to initialize and start MySQL. Note the label=disable requirement for podman shared volumes with SELinux.

podman run -ti --network=none --security-opt label=disable \
-v /tmp/mysql:/run/mysqld:shared \
docker.io/library/mariadb /bin/bash

This will download the image layers for MySQL and give me a tty terminal where I can initialize and run a blank database. For an actual use case I would have an orchestrator automate all of this.

root$ chown mysql /var/lib/mysql
root$ su mysql
mysql$ mysql_install_db
mysql$ mysqld --skip-networking

Now I’m running a database in a container with no networking, but I’m listening on a full-performance, secure UNIX socket on the host at /tmp/mysql/mysqld.sock. There is no network attack vector as I don’t need a network in the pod. If someone compromises my container somehow they have no way to connect in or out via network or host access, and yet my database is accessible at the host level. Now I just need to register my service in the host’s Consul agent by dropping this HCL file or its JSON equivalent into /etc/consul.d/. Make sure that health check scripts are enabled in agent config. Beware these will be running under the Consul user.

services = [
{
id = "mysql"
name = "mysql"
socket_path = "/tmp/mysql/mysqld.sock"
connect {
sidecar_service {}
}
check {
id = "mysql-connect"
name = "mysql-test"
args = ["/usr/bin/mysqladmin", "ping", "-S", "/tmp/mysql/mysqld.sock"]
interval = "5s"
timeout = "1s"
}
},
{
id = "client"
name = "client"
connect {
sidecar_service {
proxy {
upstreams = [
{
destination_name = "mysql"
local_bind_socket_path = "/tmp/mysql/mysql_local.sock"
}
]
}
}
}
}
]

Special thanks to Blake Covarrubias and Mark Anderson of the Consul team for helping me put together a PoC to demonstrate the new UDS support.

If your application requires a curl or API check you can do that with curl simply as well. The interesting thing is you’ll need to specify both the socket and the request as most web servers will need to route vhost, server name, etc.

First look at the “mysql” service. This HCL looks like a normal Consul service description but it uses socket_path instead of ports. I also use a health check of mysqladmin ping on that socket so Consul monitors the health of my database. The connect block is required for Envoy but I don’t need to define details as it will configure itself automatically with a consul bootstrap command. It will be great when Nomad does this automatically for us to simplify things in addition to deploying our service, but for now I’ve just opened the issue with the Nomad team. Hopefully it will be a quick feature.

Next look at the “client” service. I don’t include any checks as for this example I just want the connect proxy to run on the host no matter what. Consider it always healthy and always on. A real use case would define a service and how to identify it. All this says is take the upstream service “mysql” and proxy it to a local socket on the host at “/tmp/mysql/mysql_local.sock” which is ironically right next to the original socket “/tmp/mysql/mysql.sock”. Not exactly impressive even as a party trick. I could easily place that client socket in the path of another container.

There’s a fine line of code between party tricks and practical use cases.

Next define an intention in Consul. This is analogous to firewall rules but at the service level instead of the network layer. Nothing can talk without an explicit intention and admins don’t need to know how many IP addresses are involved in the path or how often they change.

Allow one or more services to connect to my MySQL service.

Finally run the Envoy proxy for each service. Again this would normally be done by a scheduler or orchestrator but we are doing this example from the ground up. Consul has a simple built-in proxy but Envoy is the tool of choice and its quick C++ codebase is well supported. Each service will need a proxy running before Consul will consider it healthy and able to connect. The user running these proxies will need permissions to the paths used for sockets.

$ consul connect envoy -sidecar-for=client &
$ consul connect envoy -sidecar-for=mysql &

If all goes well, Envoy is now waiting for proxy control requests from Consul, and a local socket has been created by the client proxy. You can even connect to it with MySQL and let your plaintext traffic be encrypted and decrypted locally:

$ mysql --socket=/tmp/mysql/mysql_local.sock

This feels kind of silly. Why would I care about an extra socket just pointing to another socket? For one thing I can put this socket in different directories with different user or group access to control who uses it. For a much more powerful example I’ll add another host. All I need to do is add another host into my cluster. That client service will run on the other host too. I can register any application service to access the same socket on any host.

$ mysql --socket=/tmp/mysql/mysql_local.sock 
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 36972
Server version: 5.5.5–10.5.10-MariaDB-1:10.5.10+maria~focal mariadb.org binary distribution

Copyright © 2000, 2021, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the current input statement.

mysql>

Now I’m accessing a MySQL database on a different host in a container that doesn’t even have a network. I can even access it from a client container that also has no network. Consul also supports hybrid cloud and multi-cloud if I want to expose my local service to a cloud resource via UDS. All I need is a local client socket and valid user/group permissions to access my database. No ACL is required thanks to intentions. There’s no need for container DNS and no network to configure whatsoever so long as the host has one. From there it goes end to end via automatic mTLS encryption and it only gets packetized for TCP across the network once via Envoy. If I wanted to I could even bridge that remote UDS as a local TCP listener (IPv4 or IPv6) and vice-versa, but why bother? The fewer IP addresses, the less complicated and more secure the infrastructure will be. Just make sure any services you develop offer TCP/UDS socket options.

Come to think of it, Portal is a great analogy for Consul Connect.

Benchmark

For kicks I put together a basic benchmarking tool I call sockbench to test drive some actual comparisons with UDS and TCP between two Consul hosts. Since I needed to eliminate network bandwidth as a bottleneck I needed to use my x86_64 workstation connecting to my ARM server across the 10Gb network. At first Envoy 1.18.3 was CPU bound on the ARM box with 30MB/s per flow (each thread) which wasn’t OK. It turns out there was an aarch64 BoringSSL encryption acceleration bug which was fixed in 1.19 and now yields up to 300MB/s per thread. This is still bottlenecked on CPU with encryption so I will set up a more thorough test using premium cloud VMs on full-speed networks to be thorough. In the meantime the UDS performance seems marginally faster than TCP on average:

> sockbench -c     (UDS) = 296.094MB/s
> sockbench -c9000 (TCP) = 264.980MB/s

Final Thoughts

Somewhere in the evolution of app architecture communication has become compounded in complexity. The goal in its purest form is to get data from point A to point B quickly and securely. The addition of UNIX Domain Socket support into Consul Connect presents a lot of new and interesting use cases including the possibility of simplified container communication without requiring complex virtual networking within every container host. Now it’s up to orchestrators like Nomad and possibly Kubernetes to help automate them and save the manual work done for examples like this. Imagine if all your systemd sockets could sync to Consul.

I would argue that loopback was created as a test mechanism, while any local IPC is actually much quicker and simpler without using it. Use UDS for local communication and let Consul sort everything outside the host including integrations with your physical network kit like F5 or Cisco and more. As Consul already supports hybrid service discovery and connectivity across container and non-container workloads, this feature adds another great option into the mix with hybrid cloud and zero trust networking at a whole new level of performance and simplicity.

--

--

John Boero
HashiCorp Solutions Engineering Blog

I'm not here for popular opinion. I'm here for hard facts and future inevitability. Field CTO for Terasky. American expat in London with 20 years of open source