Container Platform Networking at Cruise
Using Google Kubernetes Engine with a multi-cloud, private hybrid network.
This is part three of our ongoing series on the Cruise PaaS:
Stay tuned for more on observability and deployment!
In our previous posts, we covered how the Cruise PaaS spans multiple Google Kubernetes Engine (GKE) clusters in multiple Google Cloud Provider (GCP) environments and projects, with a bunch of addons to increase the functionality and security of GKE and make it work on our private hybrid-cloud network.
In this post, we’ll cover why we need a private hybrid-cloud network and how it works to provide another layer of security around the Cruise PaaS and its interactions with other internal Cruise workloads.
Why Private? Why Hybrid?
In order for Cruise to move quickly (but safely) towards our goal of launching a ride-hailing service using fully autonomous vehicles, we need access to huge amounts of hardware (both virtual and physical) to run a wide variety of workloads. These workloads include massive test pipelines, machine learning clusters, and data science analysis in addition to multiple distributed backend systems to facilitate ride sharing, mapping, and fleet management. To satisfy the variety and scale of hardware requirements, we use both on-premises data centers and multiple public clouds.
When the car returns to a hub location, the recorded data is extracted and uploaded to the cloud over private fiber lines.
For example, one workflow that requires hybrid connectivity is the process of analyzing recordings from car cameras, lidars, radars, and other sensors. Many people assume that this kind of data is streamed from the car to the cloud, but in reality, so much data is generated from these instruments that trying to stream it all over LTE or even 5G in real-time is simply impossible — there’s just not enough bandwidth. Instead, sensor data is buffered on local disks in the car. When the car returns to a hub location, the recorded data is extracted and uploaded to the cloud over private fiber lines.
There are plenty of other use cases as well, but this alone is enough to require a private hybrid network in order to securely transfer potentially sensitive (and voluminous) data between data centers and clouds. Once in the cloud, the car data is processed, reformatted, chopped into pieces, and consumed by a multitude of internal purposes, like testing, data analysis, simulation, and machine learning.
Cruise takes security and safety very seriously. Because of this, we use private networks for almost all of our computing infrastructure, only exposing to the public internet the services that need to be accessible externally. Being on a private network doesn’t remove the need for internal services to be security-hardened — it just adds another layer of defense against compromise.
The Cruise PaaS runs a majority of our long-running services, batch jobs, and streaming pipelines on the private network. It also provides secure ingress and egress to and from the private network as well as the public internet.
Providing this networking functionality alongside GKE requires configuring, managing, and monitoring many different cloud resources:
- Virtual Private Clouds (VPCs)
- Virtual Private Networks (VPNs)
- CIDR Ranges
- NAT Gateways
- Firewall Rules
- DNS Servers
- Public HTTP(S) Load Balancers
- Public Network Load Balancers
- Internal HTTP(S) Load Balancers
- Internal TCP/UDP Load Balancers
Alongside all those GCP services, we also deploy several custom and open source Kubernetes addons to supplement, integrate, or reconfigure GKE networking.
GCP Hybrid Connectivity Options
There’s an old fundamental truth in network engineering: “One size never fits all.” Every network is different — each has their own sets of business goals and associated challenges, and there’s no silver bullet.
Google knows this, and as such has provided a number of transport options to their cloud, each offering their own set of advantages and disadvantages. Below is a quick high-level glimpse into these connectivity methods:
With a dedicated interconnect, traffic between your datacenter and the cloud traverses a private, physical line installed by a service provider (or between cages in a shared collocation facility) to connect your site to Google’s edge network. You pay (at minimum) monthly port costs to Google for maintaining the connection, as well as any associated service provider costs.
- Covered by GCP networking SLA
- Supports up to 200 Gbps (2 x 100 Gbps lines) or 80 Gbps (8 x 10 Gbps lines) per individual connection
- Port and egress costs apply
- Internal VPC subnet prefixes only (no public peering option)
- Requires presence in an internet exchange (IX) facility
- Slow provisioning lead times if new lines need to be installed
With a partner interconnect, traffic between your datacenter and the cloud goes over existing lines owned by a service provider. That provider then sells you capacity on their shared line.
- Covered by GCP networking SLA
- Fast provisioning time due to pre-provisioned capacity to third party vendors
- Supports 50 Mbps to 10 Gbps per individual connection
- Port and egress costs apply
- Internal VPC subnet prefixes only (no public peering option)
- Lower bandwidth than dedicated interconnect
- Third-party network traversal and costs required
Virtual Private Networking
With a Virtual Private Network (VPN), traffic between your datacenter and the cloud is encrypted over the public internet, using your existing Internet connection and bandwidth. VPN termination requires that you host an on-premises device capable of static or route-based VPN.
- Covered by GCP networking SLA
- Instant provisioning time
- No private line required, due to Internet-overlayed nature
- Supports dynamic routing using the Border Gateway Protocol (BGP)
- Slower throughput due to IPSec overhead
- Expensive at scale due to internet access and crypto acceleration costs
With direct peering, traffic between your datacenter and Google traverses a direct connection, like Dedicated Interconnect, but it only peers public IPs, not private IPs. You don’t actually even need to be a Google Cloud customer here. Direct peering is the traditional route of Internet peering with Google for all of their public services, not just GCP (YouTube, G Suite, etc.).
- 10Gb & 100Gb free public peering with Google
- No port or traffic costs
- Reduced Internet egress rates to your network for your GCP projects
- Not covered by GCP networking SLA
- Public Google IP prefixes only (not VPC-aware)
- Requires presence in an internet exchange (IX) facility
- Requires provider-independent IP address space (PI) — a minimum /24 of public IPv4 space registered with a public autonomous system number (ASN)
- Integrating GCP with your on-premises network can be a manual process
- Not all interested parties are accepted by Google for direct peering
Cruise Hybrid Connectivity Choices
Cruise uses a mix of these connectivity options for cost-efficiency and redundancy: direct interconnect for high bandwidth access to GCP VPCs, VPNs for cheap low bandwidth emergency fallback, and direct peering for free high bandwidth access to public GCP services.
Reliability and redundancy are critically important so that services running in GKE are always accessible to the rest of our network.
One of our most important network metrics is throughput capability to Google Cloud Storage (GCS), which we use as a data lake. However, because GCS is a public service, traffic to GCS from on-premises won’t traverse GCP interconnects by default, instead using (potentially slower) public ISP connections.
To work around this issue, it would be ideal to set up public peering over the GCP interconnects (like AWS Direct Connect Virtual Interfaces), but GCP doesn’t currently offer this. However, GCP does offer Private Google Access (PGA) for On-Premises Hosts, which makes public Google services available on a small, predictable public IP range (18.104.22.168/30). In order for the GCP API domains to resolve to the PGA IPs on hosts within the private intranet, you need to configure private DNS with a DNS CNAME (*.googleapis.com -> restricted.googleapis.com) that masks the public A record. Then, you can configure the GCP Cloud Routers in your VPC to advertise (via BGP) the PGA IP range to your on-premises router. Once advertised, the PGA IPs are reachable from on-premises through the private interconnects (or VPNs).
Finally, we also need routed connectivity between our private GCP VPC subnets and our non-GCP private subnets on-premises and in other clouds. While our private throughput requirements (e.g. office to GKE) are lower than our public throughput requirements (e.g. upload to GCS), reliability and redundancy are critically important so that services running in GKE are always accessible to the rest of our network.
Private Network Backbone
After evaluating our options, we decided to deploy a point-of-presence (PoP) in IX colocation facilities where Google and other cloud providers also have a presence. These PoPs could then be integrated with our WAN backbone, allowing us to easily (and quickly) connect to most cloud providers, carriers, and Internet service providers (ISPs) with minimal effort.
Once our cloud connections were established in these facilities, our routers act as a bridge for our many private on-premises and cloud networks.
At each IX facility, we build duplicate interconnects to different edge availability domains. This allows us to maintain availability and an SLA with Google. These domains provide isolation during scheduled maintenance, because two domains in the same metro won’t be down for maintenance at the same time. This isolation is important when you’re building for redundancy.
Internal GCP Routing
Once physical connections are established between Cruise and GCP, we ‘divide’ up those links with logical connections known as VLAN Attachments. This ultimately allows us to establish BGP peering sessions between our physical routers and the regional Cloud Routers containing our VPC subnets.
At this point, our VPCs dynamically populate routes from the rest of our network and allow GKE and other compute services full internal reachability across our backbone. Any time a resource needs to communicate with an IP address outside of the VPC in which it lives (including to our other VPCs), it will traverse the Interconnect link at the PoP based on the best route metric to get there.
The Kubernetes network model gives each service and pod a whole IP to simplify application development, but that design choice has the consequence of complicating platform configuration and operation, especially when managing multiple Kubernetes clusters. There are two different options in GKE to enable this Kubernetes-style networking: route-based networking & VPC-native networking.
Route-based networking statically allocates a contiguous block of IPs (from the subnetwork’s primary IP range) to each node for use by pods on that node. Each node then acts as a router for those IPs. This is the original GKE networking implementation, but it’s not an overlay network like you might use for Kubernetes clusters in other environments.
VPC-native networking uses Alias IPs (secondary IPs) managed by the same subnetwork that manages the node IPs (primary IPs). VPC-native networking is the default and recommended option for a few good reasons:
- Alias IPs are natively routable within the network, including peered networks.
- Alias IPs can be announced through BGP by Cloud Router over interconnects & VPNs, if desired.
- Alias IPs can’t be used by other compute resources on the network, which prevents conflict and allows for dedicated firewall rules.
- Alias IPs can access GCP hosted services without egressing through a NAT gateway.
On the downside, VPC-native networking places a burden on the cluster operator to select and configure dedicated secondary IP ranges for each cluster. Slicing and dicing IPv4 CIDR ranges requires making educated guesses for a number of things:
- Number of clusters
- Number of regions
- Number of nodes
- Number of pods per node
Unfortunately, unless you’re adept at predicting the future, it’s unlikely that your initial GKE CIDR blocks will stand the test of time. For this reason, it’s important to design your GKE clusters to be disposable, so you can change your networking choices later.
Secondary IP Ranges
Initially, we decided on a strategy of having one subnet per region, each shared by multiple clusters. However, while the primary IP range is expandable after creation, secondary IP ranges can’t be expanded while in use by a GKE cluster. Even if you could expand them, you would need to leave contiguous unallocated IPs available on the network, at which point they might as well be pre-allocated to the subnet.
In practice, IP blocks must be allocated based on incomplete information, and later need to be re-evaluated as the cluster scales and requires additional pod or service IPs. This requires re-creation of the cluster, which in turn requires migrating workloads between clusters: a time consuming, complicated, and error-prone operation.
Until recently, GCP subnets only allowed a maximum of 5 secondary IP ranges, and while pod IP ranges can be shared between clusters, service IP ranges cannot. This restriction meant that you could deploy at most 4 GKE clusters on a single subnet. However, after deploying and operating GKE clusters for a while, it was apparent that we were going to need more than 4 clusters per region. We needed clusters for testing, development, and workloads that require extra isolation, in addition to the general-purpose shared clusters most of our tenants used.
In response to our feedback, Google thankfully increased the maximum number of secondary IP ranges to 30 (per network limits), which gave us some breathing room.
Planning For Change
Another challenge is that having clusters share pod IP ranges means that we can’t delete a pod IP range without deleting all the clusters using it, thus making it hard to change which IPs are used by a cluster. To make things simpler and easier to change, we switched to provisioning a subnet for each GKE cluster. It means a little more CIDR math up front, but is a good architectural choice to keep things easy to change in the future.
A lot of this complexity is simply due to the complexities of IPv4. Once GCP, GKE, and Kubernetes support IPv6, it should be much simpler, with fewer up-front decisions and no CIDR math. You could also reduce the complexity of managing clusters by using route-based networking (instead of VPC-native networking); however, performance would take a hit, ingress options would be more limited, anti-spoofing checks would be disabled, and the cluster would be limited to at most 2,000 nodes.
GKE comes with a suite of ingress integrations that should be good enough for most basic use cases. However, one thing to consider is that the public ingress options (from the internet to the cluster) are more robust and mature than the private ingress options (from the intranet to the cluster).
GKE currently provides four integrated ingress solutions:
- External HTTP(S) Load Balancer provides public ingress that supports HTTP(S) and uses layer 7 load balancing (reverse proxies).
- Internal HTTP(S) Load Balancer (Beta) provides private ingress that supports HTTP(S) and uses layer 7 load balancing (reverse proxies).
- Network TCP/UDP Load Balancer (NLB) provides public ingress that supports TCP or UDP and uses regional layer 4 routing (IP translation).
- Internal TCP/UDP Load Balancer (ILB) (Beta) provides private ingress that supports TCP or UDP and uses regional layer 4 routing (IP translation).
For TCP/UDP public ingress, the NLB can be configured by Cruise PaaS tenants using a standard Kubernetes Service resource of type Load Balancer. The controller that provides the integration is baked into the upstream Kubernetes’ GCP cloud provider. The NLB is implemented pretty low in the networking stack, so it doesn’t provide advanced features, like session stickiness, built-in authentication, or path-based routing. Generally, we only use the NLB for non-HTTP traffic, unless it’s a middleman for an application-layer HTTP proxy, like Nginx or Envoy.
For HTTP(S) public ingress, we use the Google External HTTP(S) Load Balancer. The open source Google Load Balancer Controller (GLBC) comes integrated with GKE by default, allowing Cruise PaaS tenants to opt into public ingress using a standard Kubernetes Ingress resource. This (recently generally available) solution has a decent feature set, including Google Cloud Armor (for IP whitelisting), Cloud Identity-Aware Proxy (for authorization), HTTP/2 and websocket support, full isolation between ingress instances, and optional TLS termination.
For TCP/UDP private ingress, the ILB can be configured by Cruise PaaS tenants using a Kubernetes Service resource of type Load Balancer, the same mechanism used to configure the NLB, except with an annotation to make it private (cloud.google.com/load-balancer-type: Internal). The ILB is effectively very similar to the NLB, except that it’s only accessible from within the VPC and only within the same region by default. So like the NLB, we generally use the ILB for non-HTTP traffic.
For HTTP(S) private ingress, Google recently released the Envoy-based Internal HTTP(s) Load Balancer (still in beta as of mid-2019). This solution looks really promising, but the beta doesn’t support Shared VPCs, Google Cloud Armor, or Cloud Identity-Aware Proxy yet, so it needs more maturity before we could consider using it. Instead, we’ve used the Nginx Ingress Controller, which allows Cruise PaaS tenants to configure their load balancing using annotations on a Kubernetes Ingress resource definition.
Here’s a diagram of our initial private ingress solution, with a bonus nod to External DNS, which we would like to add to manage DNS records more easily:
Our ingress journey is far from over, though. As we explore multiple regions and service meshes, we’ll definitely have to iterate on our ingress solutions. Stay tuned for future blog posts on these topics!
If your GKE cluster nodes have public IPs, you get egress to the internet for free, but if you deploy your nodes on private subnets for added security (like we do), then your public egress traffic has to transit a NAT gateway to reach the internet. When we originally deployed GKE, we had to deploy our own NAT gateways, using a fork of Google’s NAT Gateway Terraform Module. However, in early 2019 Google launched Cloud NAT, a fully managed solution that has reduced this management overhead for our engineering team.
It’s worth noting that Kubernetes doesn’t natively provide any Quality of Service (QoS) features to isolate ingress or egress. All network traffic shares the individual node’s resources, and more broadly, the network’s resources. As a result, it’s pretty easy for a single Kubernetes pod to consume all the bandwidth of a shared node or the shared NAT gateways and cause a bottleneck, if you’re not careful.
If you absolutely need ingress and egress isolation now, you may have to peel back the higher layer abstractions and use a lower level abstraction instead.
Upgrading to Skylake VMs helped with some of our networking bottlenecks, because Google raised the egress bandwidth cap to 32 Gbps (on 16+ core instances). However, the highest speeds are limited to same-zone VM-to-VM traffic. While the new architecture is still faster than the previous Broadwell architecture, it’s not possible to achieve 32 Gbps on all traffic within a multi-zonal cluster or between regions/clouds.
Traffic shaping is one of the ways we’re looking to increase network isolation between workloads. For example, a service mesh like Istio can enable rate limiting and quotas using sidecar proxies and ingress/egress gateways. Another way would be to use Kubernetes Network Plugins, like the Bandwidth Plugin, which is included in Calico in v3.8.0 (GKE still ships with v3.2.7).
Autoscaling based on network usage is another way to avoid resource exhaustion, reducing the impact tenants can have on each other. NAT gateway autoscaling would help maximize availability at the VPC-level. However, if the NAT Gateways have static public IPs used for firewall whitelisting, those whitelists have to be updated automatically or pre-emptively to allow autoscaling. GKE cluster autoscaling based on workload network requirements would in theory also help maximize availability at the node-level, but Kubernetes doesn’t yet support scheduling based on network resources. Both of these options are possible, but will require significant investment to get right.
Container network QoS is still in its infancy compared to software defined networking capabilities provided by virtual machine infrastructure. If you absolutely need ingress and egress isolation now, you may have to peel back the higher layer abstractions and use a lower level abstraction instead, like virtual machines. Virtual machines on Google Compute Engine (GCE) share host machine networking resources the same way containers do conceptually, but GCE enforces QoS by using a per-VM egress throughput cap.
Recently, Google published a great article with guidelines for creating scalable (GKE) clusters, which is a great resource that pulls together a lot of the constraints that are strewn throughout the GCP and GKE documentation. It also builds on some of the scalability thresholds put together by the Kubernetes Scalability Special Interest Group (SIG).
One of the most notable constraints is the limit of 5,000 GKE nodes. The GKE docs call this out as a limit, and the Kubernetes docs call it out as a threshold of degraded performance, but it turns out it’s also a networking limitation:
“A single VPC network can have up to 15k connected VMs, so, effectively, a VPC-native cluster scales up to 5000 nodes.”
It may be possible to get this limit of 15k VMs per network increased, but it’s worth noting that this is a VPC-wide constraint, not just a cluster constraint, so adding additional clusters within the VPC won’t get you more than 5k nodes.
Given these constraints, you might be tempted to maximize node size and density, keeping the default 100 pods per node. This strategy is also a good idea when paying per-agent for daemons, like DataDog. But as mentioned above, one must consider how well the networking components of the platform scale vertically.
A final concern with large nodes is that Kubernetes offers no isolation for disk input/output operations per second (IOPS). This means your application workloads can use up all the disk IOPS and break critical components, like the Docker daemon.
The only ways to combat these network and disk sharing concerns with GKE’s current capabilities is to maximize the resources available per node (by using the latest VM architectures and the largest instance types) and reduce the maximum number of pods that can be scheduled per node. This way each pod has more individual capacity available.
Reducing pods per node is generally effective, but not an ideal solution. Ideally, disk and network resources would be managed similarly to other resources in Kubernetes (CPU, memory, and GPU). Having quotas, limits, and hints for scheduling would allow for better isolation between pods, namespaces, and tenants. Another issue with reducing pods per node is that it increases the number of nodes required for the same workload, which can make it more likely that you’ll hit node count limits.
You may not need more than 5,000 nodes, especially if they’re 80 core machines running 100 pods each. But as you get closer to the hard node limits, you will probably run into other limitations that require you to replace components and integrations that come with GKE, such as load balancers, KubeDNS, or the cluster autoscaler. At Cruise, our scaling needs have required us to replace some of these components already, but your mileage may vary.
To Be Continued…
In this post, we’ve explained how Cruise deploys Kubernetes clusters on private networks, how we connect these networks with our on-premises datacenters, and how we deploy scalable ingress and egress for our clusters. In the next blog post of this series, we will look at the Observability of our clusters and workloads running on them.
Interested in helping us build the platform paving the way for autonomous vehicles? Check out our open positions.
Updated on 2019–10–16 to correct the description of route-based networking and the benefits of VPC-native networking. Thanks to Tim Hockin for reporting the error.