Kubernetes, Istio and The World Outside Rapido

Published in

Rapido Labs

8 min readJun 30, 2020

If you are running Kubernetes (k8s) clusters in production and security is of utmost importance to you, you would have been at crossroads to choose between a private or a public cluster. Most of the major cloud providers give these options via their managed Kubernetes service solutions and all you need to do is choose one. And once you choose a private cluster, one of the immediate problems to tackle will be handling egress traffic (to the outside world). With a private cluster, all the nodes will only have a private IP and all egress traffic will need to be routed through some kind of a gateway that can talk to the internet.

We were also at this crossroads and made the decision to use a GKE private cluster and allow internet access via a Cloud NAT. While this worked very well for us from a security perspective, it caused us some problems when it came to handling outbound traffic bursts and we had to think about alternate solutions for handling this more efficiently.

Since we were already using a Service Mesh within our k8s clusters, we started thinking about if we could leverage it in some way to have a better solution for our problem.

The final design we came up with was to use the Service Mesh for routing traffic to a set of proxies with SSL pass-through, running on nodes outside the k8s cluster with external IP addresses, thereby bypassing cloud NAT. We also had to ensure that nothing changed from an application perspective, like the URLs being configured, and the proxies have fail overs.

The setup when we started off was very simple and is shown below.

There is a private k8s cluster and a Cloud NAT, which was setup to perform NAT on the primary address range of the subnet. This would work well when the RPS is not that high (how high will have to be derived from the math stated below). Once you have the high RPS workloads in the cluster, the problem will slowly start to show up. We started seeing errors in application logs saying connection was refused and debugging further led us to Cloud NAT logs where we saw connection dropped errors being logged. An example of the NAT log entry is shown below.

This prompted us to go back to Cloud NAT specifications and look for the section where they had mentioned how the ports are calculated based on the external IP addresses (https://cloud.google.com/nat/docs/ports-and-addresses).

At a high level, this is what it says :

The number of ports allocated per node restricts the total number of simultaneous connections that can be established to one unique destination from that node. The destination is derived from IP address, port and protocol.

The math we overlooked :

ports_per_vm         = 64
total_nat_ip_address = 3
ports_per_nat_ip     = 64512
total_vms            = 100

Doesn’t look bad, right? We need a total of 64 * 100 = 6400 ports and we have much more than that here, 3 * 64512 = 193536.

The issue was not the availability of the total allocatable ports by Cloud NAT, rather it was the number of ports allocated per node. In this case, it’s 64. This means, we can only have 64 simultaneous connections to, let’s say https://example.com (assuming it resolves to one public IP) from that node.

Now imagine a case where two pods are running on the same node and each has an RPS of 100 and needs to make an external call per request. This can lead to port exhaustion in that node and errors in the application. This is exactly what Cloud NAT logs were telling us by saying connections were dropped. GCP also mentions that they induce a 2-minute delay before the gateway can reuse the same source address and port for establishing a new connection. This only makes things worse. Check out the link for more details.

So then, we decided to ensure there are enough ports allocated per node for handling the traffic bursts.

The new math :

ports_per_vm         = 8192
ports_per_nat_ip     = 64512
total_vms            = 100
total_nat_ip_address = (ports_per_vm * total_vms) / ports_per_nat_ip
                     = 8192 * 100 / 64512 = ~13

Do you need so many ports per node is a question you will need to answer based on the traffic patterns you have. We did try with lower numbers and kept on increasing in steps till we reached this value and stopped seeing the connection dropped errors. (This could have been due to the fact that we have our infrastructure spread across multiple networks and there are certain connections that have to be made through public IPs).

Yes, 13 addresses are not that bad. But we had two issues with this :

It became a bit cumbersome to get all these IPs whitelisted at the third party network we had to connect to.
And to satisfy the use case for a subset of the workloads, we were ensuring all nodes have the minimum required number of ports allocated. This means most of these ports were sitting idle and is not an efficient way of utilizing resources.

To address these problems, we started thinking about a solution which will allow us to route our high RPS external calls through a cluster of egress proxies, which will all have a public IP assigned to them and thus will allow us to bypass the NAT and we didn’t have to worry about deciphering the right value for the ports_per_vm variable. This means we can keep the NAT IPs to a minimum and expand the egress cluster as and when needed. We could use Nginx as the egress proxy with SSL pass-through using the stream module and don’t have to worry about the limit on simultaneous outbound connections any more.

Since all the calls are from within the k8s cluster equipped with Istio, we could use Service Entry, Workload Entry, Destination Rule, and VirtualService to configure Istio, and in turn Envoy, to perform the routing through the egress proxies in a reliable way.

Let's understand briefly what all these components of an Istio service mesh are responsible for:

A ServiceEntry allows you to add services outside the mesh into the service registry of Istio thereby enabling traffic management to these services. We will create service entries for the hosts we are connecting to and the egress proxies, as both are residing outside the mesh.

A WorkloadEntry along with a ServiceEntry allows you to configure Clusters in Envoy. A Cluster is nothing but a group of upstream targets where the traffic has to be routed based on certain match conditions. We will use workload entries to create a cluster for the egress proxy having multiple endpoints.

A DestinationRule allows you to configure what happens to the traffic for a given cluster. We will use destination rules to configure health checks and ejection for the egress proxy endpoints.

A VirtualService allows you to configure routes in Envoy. Routes allow us to mention the upstream cluster to which traffic has to be routed based on a set of conditions. We will use virtual service to route the traffic to external hosts via the egress proxies.

In our case, let’s say the external API we are calling is https://example.com and the egress proxies are egress-1.mydomain.internal & egress-2.mydomain.internal, what we need to tell envoy is that :

If you see an SNI named example.com in the request, please forward it to one of the healthy instances among egress-1.mydomain.internal or egress-2.mydomain.internal.

The details of the service entries, workload entries, destination rules and virtual services for achieving the above result is shown below :

Workload entries per egress proxy and tagged with a label (app:egress-proxies)

Service entry with a workload selector targeting the two workload entries created in the above step.

Destination rule instructing Envoy to eject endpoints that have crossed the threshold of failures within a given period of time. You can play around with the outlier detection configuration to match your needs.

Virtual service instructing Envoy to route the traffic to example.com via the egress proxy.

Once the above setup is done you should have the traffic to example.com routed via the egress proxies. To verify all is good, you could run the sleep deployment in Istio docs and do a curl request and check the Istio proxy logs as mentioned there. You should see the routing being done to the clusters we created and not to the PassthroughCluster (a virtual cluster to which all traffic which Istio doesn’t care about is routed).

You could verify the Istio configuration using these commands :

## Verify the listenersistioctl proxy-config listeners <pod-name> --port 443 --address 0.0.0.0 -o json | jq## OutputfilterChainMatch": {
    "serverNames": [
        "example.com"
    ]
    ......
}Routing rule when matched :{
    "name": "envoy.tcp_proxy",
    "typedConfig": {
    "@type": "type.googleapis.com/envoy.config.filter.network.tcp_proxy.v2.TcpProxy",
    "statPrefix": "outbound|443||egress.mydomain.internal",
    "cluster": "outbound|443||egress.mydomain.internal",
    .....## Verify the cluster and it's endpointsistioctl proxy-config clusters <pod-name> --fqdn "outbound|443||egress.mydomain.internal" -o json | jq## Output{
"name": "outbound|443||egress.mydomain.internal",
    .....
    "clusterName": "outbound|443||egress.mydomain.internal",
    "endpoints": [
        {
            "locality": {},
            "lbEndpoints": [
            {
            "endpoint": {
                "address": {
                "socketAddress": {
                    "address": "egress-1.mydomain.internal",
                    "portValue": 443
                    }
                }
            },
            .....
            },
            {
            "endpoint": {
                "address": {
                "socketAddress": {
                    "address": "egress-2.mydomain.internal",
                    "portValue": 443
                    }
                }
            },
    ....    
    }
}

If you want to test the ejection of unhealthy endpoints, you could kill one of the egress servers, keep firing requests and watch the output of endpoints command to see the endpoint is marked as unhealthy.

watch -n 1 'istioctl -n services proxy-config endpoints <pod-name> --cluster "outbound|443||egress.mydomain.internal"'### You should see something like this :ENDPOINT STATUS OUTLIER CHECK CLUSTER
x.x.x.x HEALTHY OK outbound|443||egress.mydomain.internal
x.x.x.x HEALTHY Fail outbound|443||egress.mydomain.internal

A gist of the nginx config we used is mentioned here for your reference. Note the use of the stream module. We are using nginx 1.17.9 and the stream module is enabled by default in this version. We have come across some older versions where it’s not enabled by default.

stream {
    server {
        listen 443;
        ssl_preread on;
        proxy_pass $ssl_preread_server_name:$server_port;
    }
}

Since these egress proxies are running on nodes outside the k8s cluster, we do not have an auto-scaling solution today. Autoscaling groups could be a potential solution or running a small separate public k8s cluster to host egress proxies might also be an option (subject to how the dynamic external IP addresses are handled since that can cause issues while whitelisting). We are spiking on these and will keep publishing our findings.

In summary, service entries and workload entries allow us to add services outside the mesh into the service registry of Istio and have additional traffic management configuration imposed on them via virtual services and destination rules. We used these to add the hosts of the third-party services we talk to and the egress proxies, into the service registry and configured the traffic to be routed via these proxies. This allowed us to bypass Cloud NAT as these proxies have their own public IPs and thereby not worry about optimizing ports allocated per node to suit our workloads.

—

We are always looking out for passionate people to join our Engineering team in Bangalore. Check out the link for open roles: https://bit.ly/2V08LNc

Kubernetes, Istio and The World Outside Rapido

Written by sree rajan