Linux Capability NET_ADMIN in a Hardened Kubernetes World

You don’t have to be scared by NET_ADMIN, even if you have shared responsibilities and allow this to your customers!

Published in

Mercedes-Benz Tech Innovation

5 min readMay 20, 2021

Signed-off-by: Tobias Giese
Co-authored-by: Mario Constanti

Since Q2/2017 Daimler offers a fully managed, on-premise Kubernetes platform. This platform, called Daimler Hybrid Cloud — Container as a Service (DHC CaaS), is built on top of OpenStack (IaaS) and runs in three different geographical regions (NAFTA, EMEA, APAC) on multiple, independent OpenStack installations. Until now our customers, i.e. application teams from various parts of Daimler, have ordered 630 clusters with 3 646 worker nodes and 86 365 running Pods.

Do we grant full access to the Kubernetes clusters?

No, but we give customers a cluster-admin-like access to the Kubernetes API. For this we are using the Open Policy Agent (OPA) (click the link if you are interested in how we use OPA).

Restrictions like non-root containers are enforced via the following Pod Security Policies (PSP):

This PSP is not editable by the customer through our OPA restrictions.

This PSP is fine if you are not using network-related debug tools like tcptraceroute or tcpdump. Further, other tools like services meshes are also not possible with the lack of privileges.

But why not just allow NET_ADMIN?

As mentioned before, we are running a fully managed Kubernetes platform with 24/7/365 on-call support. Therefore, we need to ensure that our cluster-critical components will not be interrupted, changed, or even deleted. Also, we don’t want to get paged at night more often than necessary. Fair, right?

That’s why our customers are not able to access and modify workload in the kube-system namespace.
How about customer-owned namespaces? Would it be safe to grant all customers the possibility of adding the NET_ADMIN capability to their containers?

During multiple internal discussions with colleagues, we assumed that allowing NET_ADMIN and NET_RAW should theoretically be fine, yet we decided to validate this further from a security point of view.

From man (7) capabilities

If you look at the capabilities man page, you can find a list of allowed operations.

CAP_NET_ADMIN
      Perform various network-related operations:
      * interface configuration;
      * administration of IP firewall, masquerading, and accounting;
      * modify routing tables;
      * bind to any address for transparent proxying;
      * set type-of-service (TOS);
      * clear driver statistics;
      * set promiscuous mode;
      * enabling multicasting;
      * use setsockopt(2) to set the following socket options:
        SO_DEBUG, SO_MARK, SO_PRIORITY (for a priority outside the
        range 0 to 6), SO_RCVBUFFORCE, and SO_SNDBUFFORCE.

What can we do with all these operations?

As explained in Cluster Networking,

Every Pod gets its own IP address. […] This creates a clean, backwards-compatible model where Pods can be treated much like VMs or physical hosts from the perspectives of port allocation, naming, service discovery, load balancing, application configuration, and migration.

Therefore, the capability NET_ADMIN should be safe for us to use for the most use cases.

But let’s take a closer look at all operations.

Interface configuration

Start with some interface configuration and see what we may break. As we like Jumbo Frames, we configured the network interface on a client and server Pod by setting the MTU up to 9000 and ran an iperf test.

In our example application we ran a shell, therefore we had to set the capability to the shell binary instead of ip and iperf bin:

The following shows our server changes where the MTU was set to 9000 and the iperf command was run to listen on requests:

The client also changed the MTU 9000 and run the iperf command to send traffic to the server:

After deploying the application, we checked the client side logs:

$ kubectl logs job/client

Server side logs:

$ kubectl logs -l app=server --tail=20

As we can see in the logs, it doesn’t matter how the virtual interface is configured, as long as the direct network interface and virtual switches between are configured identically.

Modify routing tables

Modifying the routing table is not expected to be a problem for us because the routing table is also namespaced and will not affect the host network.

A good example is the use of go-mmproxy. Internally, it uses the PROXY protocol. Hence, you can use it without fearing to harm the underlying host system.

The usage of PROXY protocol in our environment might be a good separate article and would blow the scope of the present article.

Bind to any address for transparent proxying

Theoretically, it’s not an issue for us because we don’t have shared clusters and spoofing is thereby only possible inside a single dedicated cluster. Traffic outside a cluster isn’t affected as we do source NATing on router interfaces and the source address will be rewritten anyway.

Practically, we also didn’t have an issue with IP spoofing, because Calico applies an iptables rule which prevents IP spoofing. The rule in the PREROUTING-chain looks like this:

-A cali-PREROUTING -m comment — comment “cali:V6ooGP15glg7wm91” -m mark — mark 0x40000/0x40000 -m rpfilter — invert -j DROP

Before v3.12 in Calico, IP Spoofing was restricted via the rp_filter sysctl (link to PR).

From our understanding, without testing it, Cilium is doing something similar in order to prevent IP spoofing by setting the sysctl rp_filter

Clear driver statistics

Calico creates a virtual interface (veth pair) for each new started pod.

The veth driver has only a limited number of available ethtool_ops. Therefore, it’s not possible to get or set the ring parameters. Getting or setting the ring parameter would return -EOPNOTSUPP(Error operation not supported).

As a consequence, you won’t get any issues if you allow NET_ADMIN.

Without testing it, but from our understanding, cilium also creates a veth pair.

Set promiscuous mode

To validate if promiscuous mode only affects the namespaced network, we will build and deploy a tcpdump container and set it to the promiscuous mode.

After deploying the application, we can check the logs of the container. In this case, it’s able to sniff traffic outside its network. The below command will reveal the traffic of the kube-apiserver.

$ kubectl logs -f promisc-tcpdump-5b7ff5f6d9–46q5f1

Promiscious mode

As we are only seeing the DNS resolution traffic, it’s no problem for us.

Enabling multicasting

At the time of writing this post, we aren’t sure why “Enabling multicasting” is listed as a separate operation in the capabilities man page. Multicasting is toggled on/off on the interface, so it should be a part of “Interface configuration”. This in turn would mean that whatever a user has configured, it doesn’t matter as long as the parent network interface is configured the same way.

Administration of IP firewall, masquerading, and accounting
Set type-of-service (TOS)
Use setsockopt

We’ve rated these topics as non-critical as each of these is either related to a network namespace or depends on the underlying host filesystem (i.e. sockets), which is neither mounted nor readable by default. Therefore, it is safe to say, that all these network-related operations are user-related tasks which will not harm the system.

Conclusion

All in all, allowing the NET_ADMIN capability will be safe for you.
But, if you know it better, or if we missed some important details, please leave a comment or contact us directly. Seriously!

Thanks to Alexander, Johannes and Sean for reviewing this blog post.