Deploying a scalable STUN service in Kubernetes

Gabor Retvari
L7mp Technologies
Published in
10 min readJun 21, 2024

So you want to deploy your own STUN service. You already know that STUN is an ancillary protocol used in real-time communications for hosts located behind an external NAT to learn their public IP address. You also understand that STUN is critical for WebRTC clients to make direct peer-to-peer connections between themselves. You have good reasons to avoid popular free public STUN services for reasons that are probably related to privacy and security. You are a cloud-native enthusiast, you host your own blog in a dedicated Kubernetes cluster, and you see no reason not to deploy your STUN servers in Kubernetes too.

You’d be surprised how many users ask us how to deploy a cloud-native STUN service in Kubernetes. This post will show a simple solution, leveraging everyone’s favorite Kubernetes media gateway: STUNner.

STUN vs TURN: What’s the big deal?

Let’s start with a short reminder for those of you unfamiliar with the acronym-soup of protocols in WebRTC.

WebRTC connections are inherently peer-to-peer, meaning that peers must establish a direct UDP connection between themselves for transmitting real-time media. Theoretically, this makes it possible for any browsers to connect to each other without the involvement of a server. Given today’s Internet reality this is more difficult than it should be: most clients use a private, “unroutable” IP address and are located behind a NAT or a firewall, which prevents them from establishing the direct connection.

Enters the Session Traversal Utilities for NAT (STUN) protocol. This protocol allows clients to negotiate the connection through a NAT or a firewall, letting any Internet host to self-discover its public IP address. Think of a STUN server as something similar to the popular WhatsMyIP.org service, just baked into a nice protocol standardized by the IETF.

Once clients know their own public IP address and port mapping they can share this information between themselves. In WebRTC speak, during the ICE negotiation the public IP and port mapping learned from STUN will be included in the ICE candidate list sent to the peer as a server-reflexive candidate. After exchanging the ICE candidates between themselves, the peers attempt to create a UDP connection between the learned public IP of their respective NATs.

Whether this succeeds depends on whether both clients can route to the other’s discovered public IP address and port, which is a big “if”. In reality, many NATs and firewalls block direct connections, which will cause the WebRTC peer connection to fail. If your WebRTC service already uses a media gateway hosted in Kubernetes then STUN is also unnecessary (and borderline dangerous due to the fragile mappings it tends to create): just ingest your media traffic into the cluster through TURN (a more “heavy-weight” NAT traversal protocol) using STUNner and disable STUN all together.

Deploying STUNner as a STUN service

Not convinced to ditch STUN for good? Don’t worry, we understand that in certain cases STUN is unavoidable and we have you covered. Below we show how to deploy STUNner as a scalable STUN service in Kubernetes.

For the purposes of this write-up we use a fresh “GKE standard” Kubernetes cluster with two nodes (that we have two nodes will be important later!) and we installed STUNner with the default configuration. The resultant setup is as follows: the direct connection between the public interface of the two NAT boxes is what we intend to use for our WebRTC connection.

STUN service in Kubernetes: Testbed setup

First we prepare the usual boilerplate: a GatewayClass to register STUNner as a controller for the Gateway API, plus a GatewayConfig to set basic config.

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: stunner-gatewayclass
spec:
controllerName: "stunner.l7mp.io/gateway-operator"
parametersRef:
group: "stunner.l7mp.io"
kind: GatewayConfig
name: stunner-gatewayconfig
namespace: stunner
description: "STUNner is a WebRTC ingress gateway for Kubernetes"
---
apiVersion: stunner.l7mp.io/v1
kind: GatewayConfig
metadata:
name: stunner-gatewayconfig
namespace: stunner
spec:
authType: plaintext
userName: "dummy-user"
password: "dummy-password"

Note that the user name and password are arbitrary: by default STUN does not need authentication.

Next we can deploy a STUNner Gateway that will represent our STUN server. The Gateway will be called stun-server-udp and run a STUN listener on the port UDP:3478.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: stun-server-udp
namespace: stunner
spec:
gatewayClassName: stunner-gatewayclass
listeners:
- name: udp-listener
port: 3478
protocol: TURN-UDP

Don’t let yourself be confused with that we set the protocol to TURN-UDP for the STUN listener: since TURN is an extension of STUN this will still produce a perfectly valid STUN server, it’s just that it will happen to run a TURN speaker as well. Not that anyone could ever connect anywhere via that TURN server: for that we would need a UDPRoute, which we, quite ingeniously, refuse to add here.

STUNner automatically creates a Kubernetes Deployment for each Gateway (with the same name) to run the Gateway’s STUN/TURN servers in stunnerd pods. Gateways will also have an associated LoadBalancer Service deployed automatically, which will expose the Gateway’s listeners to the Internet on a public IP. Among many useful things, this enables scaling the STUN server behind the load balancer dynamically (say, in concert with load).

Let’s check that everything works as expected.

$ kubectl -n stunner get pods,svc
NAME READY STATUS RESTARTS AGE
pod/stun-server-udp-7b89c659fd-4ppcf 1/1 Running 0 6m59s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/stun-server-udp LoadBalancer 10.15.168.20 A.B.C.D 3478:31211/UDP 22m

Note that you may have to wait a little until the cloud provider assigns a public IP to the LoadBalancer Service (which we replaced with A.B.C.D above).

Let’s also check the dataplane config of our Gateway. This can be done by the handy stunnerctl tool (make sure to install it first).

$ stunnerctl -n stunner config stun-server-udp
Gateway: stunner/stun-server-udp (loglevel: "all:INFO")
Authentication type: static, username/password: dummy-user/dummy-password
Listeners:
- Name: stunner/stun-server-udp/udp-listener
Protocol: TURN-UDP
Public address:port: A.B.C.D:3478
Routes: []
Endpoints: []

All is well: our STUN server is deployed and it is ready to answer STUN requests.

Testing

We will use the icetest.info service to test our STUN service (but feel free to pick your favorite ICE test tool). Open icetest.info, remove any STUN and TURN servers in the default list, and add the new STUN service URI. For that, we will need the public IP and port in the first place. We already know this but, as a nice and purely academic exercise, let’s obtain it directly with stunnerctl, shall we?

$ stunnerctl -n stunner config stun-server-udp -o jsonpath='{.listeners[0].public_address}'
A.B.C.D
$ stunnerctl -n stunner config stun-server-udp -o jsonpath='{.listeners[0].public_port}'
3478

Use the Add STUN input box to add the URI stun:A.B.C.D:3478 (replace A.B.C.D with your own IP!) and click START TEST. You should see a result similar to the below (we removed the host ICE candidates for clarity):

ICE server list
URL: stun:A.B.C.D:3478
Results
IceGatheringState: complete

srflx udp 10.128.0.11:21228 0.0.0.0:0
srflx udp 10.128.0.11:30279 0.0.0.0:0

We got two server reflexive (srflx) ICE candidates: this means our STUN service is functional!

Preserving clients’ source IP

All is not good, however. Our STUN service should return a public IP address, but it seems we got a private IP 10.128.0.11 instead. This kind of defeats the purpose of STUN, and it will most definitely not work to establish peer-to-peer connections. What is going on here?

The trick is in how Kubernetes ingests external network traffic into the cluster. To summarize a fairly complex process, first the cloud gateway will select a random node of the cluster, and then from there a separate proxy called the kube-proxy will forward the connection on to one of the pods. Now, kube-proxy ordinarily performs a source NAT to exchange the source address in the forwarded packets with its own. This results that by the time it reaches STUNner the packet will contain the private IP address of the kube-proxy instead of the original client IP we want to know.

Indeed, a quick check attests that the private address 10.128.0.11 we got in the above STUN response is in fact the IP address of one of the kube-proxy pods:

$ kubectl -n kube-system get pods -l component=kube-proxy \
-o=custom-columns="NAME:.metadata.name,IP:.status.podIP"
NAME IP
kube-proxy-gke-stunner-test-default-pool-xxxx-yyy 10.128.0.11

There are two ways to prevent Kubernetes from touching the source IP address in STUN requests: one ugly and one that is less ugly but still suboptimal. Remember: the default mode of operation in Kubernetes is to swap the client IP, so anything we do to preserve it will involve some sort of a hack. Unfortunately, hacks always come with unavoidable long-term consequences and hard-to-debug issues. Yet another argument against relying on STUN for your Kubernetes bound WebRTC service.

The ugly way

Still not convinced to avoid STUN? Then let’s see the ugly hack first.

This hack requires deploying our STUN server into the host network namespace of the Kubernetes node it runs on. This way the STUN server will share the network namespace of the Kubernetes node, which eliminates the need for the kube-proxy to NAT STUN requests and touch the source IP.

Deploying into the host network namespace is very simple with STUNner (we intendedly made it so): just set the hostNetwork:true config option in the Dataplane spec (this is called default unless you customized it), which serves as a template to create stunnerd pods:

$ kubectl patch dataplane default - type=merge -p '{"spec":{"hostNetwork":true}}'

This will immediately redeploy the STUN server pods in the host network namespace and, as we can check using icetest.info, result a STUN response with a valid public IP:

ICE server list
URL: stun:A.B.C.D:3478
Results
IceGatheringState: complete

srflx udp E.F.G.H:21228 0.0.0.0:0
srflx udp E.F.G.H:30279 0.0.0.0:0

The IP address returned should correspond to the public IP address of the host you are conducting the test from, or, if your host connects via a NAT (which is like 99% of the cases), then the external IP address of the NAT box you connect through. (We replaced the real IP address above with E.F.G.H for reasons that are mostly related to privacy and sheer paranoia.) A quick test with WhatsMyIP.org should quickly confirm that the returned address is indeed the public IP associated with the test host.

If you’re really interested in what’s going on at the STUNner side, you can look into the stunnerd logs (this is always the best way to learn STUNner insternals). First elevate the loglevel to all:TRACE to make stunnerd emit extensive debug messages:

$ kubectl -n stunner patch gatewayconfig stunner-gatewayconfig -type=merge \
-p '{"spec":{"logLevel":"all:TRACE"}}'

And then run the icetest.infotest and dump the logs:

$ kubectl -n stunner logs $(kubectl -n stunner get pods -l app=stunner \
-o jsonpath='{.items[0].metadata.name}')

[…] server.go:38: turn DEBUG: Received 20 bytes of udp from E.F.G.H:38532 on [::]:3478
[…] server.go:63: turn DEBUG: Handling TURN packet
[…] stun.go:12: turn DEBUG: Received BindingRequest from E.F.G.H:38532

It turns out that STUNner got the STUN request (it is actually logged as a TURN request but, recall, TURN is just a highly evolved form of STUN) from the correct IP and it responds accordingly.

Didn’t look so ugly, no? Unfortunately, that’s just the surface. In fact, host-networking is a terrible hack that goes against the very spirit of Kubernetes, and using it will unavoidably have terrible long-term consequences involving difficult to debug port clashes and serious security problems. In many hosted Kubernetes services (like GKE Autopilot) host-networking is disabled all together.

If ready testing, don’t forget to remove the hostNetwork: true setting to redeploy the STUNner dataplane in a private network namespace:

$ kubectl patch dataplane default - type=merge -p '{"spec":{"hostNetwork":false}}'  

The less ugly way

The less ugly solution is to use the built-in option provided by Kubernetes to preserve clients’ source IP address when ingesting external traffic into the cluster. For this, we must set the externalTrafficPolicy:Local config option in the LoadBalancer Service that exposes our Gateway. There is one catch though (in fact there are two, but let’s start slow): the Service is automatically created by STUNner so there’s no easy way to change it. Luckily, STUNner offers a flexible way to customize LoadBalancer Services via Kubernetes annotations.

Let’s add the stunner.l7mp.io/external-traffic-policy:local annotation to our Gateway:

$ kubectl -n stunner annotate - overwrite gateway stun-server-udp \
stunner.l7mp.io/external-traffic-policy=local

Running the ICE test again will yield the correct public IP:

Results
IceGatheringState: complete

srflx udp E.F.G.H:52794 0.0.0.0:0
srflx udp E.F.G.H:47375 0.0.0.0:0

Towards a scalable STUN service in Kubernetes

All is not that rosy, however: if you run the test several times (you may need to reload icetest.info along the way) you will see that sometimes the ICE connectivity test fails to produce valid server-reflexive candidates. This depends on many things: our GKE tests report 100% success rate while other Kubernetes installs produce completely different results. It seems there’s something still missing.

The catch is in the way many Kubernetes deployments implement the externalTrafficPolicy: Local setting. Recall, this config option instructs the kube-proxy to retain the clients’ original source IP address, which is a requirement for our Kubernetes-bound STUN server to produce a valid STUN response. This has the side-effect that if there is no stunnerd pod deployed on the Kubernetes node that happens to receive the request, then Kubernetes has no way to pass the request on to another node since that would involve a NAT step and change the client IP. Thus, the request is silently dropped.

The solution is to make sure at least one stunnerd pod runs on each node. There are a zillion ways to achieve this, from deploying STUNner into a DaemonSet instead of a Deployment to using the topologySpreadConstraints config option. Below we show the simplest possible way: ask the Kubernetes scheduler to prefer nodes that do not already have a stunnerd pod when scheduling new stunnerd pods. This can be done by adding a podAntiAffinity setting to the STUNner dataplane template:

apiVersion: stunner.l7mp.io/v1
kind: Dataplane
metadata:
name: default
spec:
...
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- stunner
topologyKey: topology.kubernetes.io/hostname

Now if you redo the ICE test you will get a valid server-reflexive candidate every time you try!

Conclusions

Of course, in production you may want to add a horizontal pod autoscaler as well to adaptively scale the STUN service based on user demand. We leave it as an exercise to find out how to ensure that pods are spread evenly across nodes in this case too. Hint: it’s not difficult. This is Kubernetes after all, so the problem is not necessarily that “can this be solved at all?”, but rather that “given the zillion options to do this, which is the best choice for my use case?”. Deciding this will require some domain-knowledge and may also involve some deep Kubernetes hackery. Don’t forget: if lost, we’re always there for you to provide professional help! Just contact us and let’s talk!

--

--

Gabor Retvari
L7mp Technologies

Academic & industry researcher in Algorithms & Systems, PhD in EE. Co-founder & CTO of l7mp.io, the Kubernetes WebRTC company.