Networking a globally distributed infrastructure spread across our own hardware in multiple datacenters while migrating to AWS has been a challenge. The Zendesk Foundation Network team has helped morph the architecture over time and are excited where it landed. In this post series, we wanted to share how our core networking has evolved and where we plan to go next.
Where we’ve come from
Legacy data centers
Pre-2015, we had four physical data centers across the globe that were fully interconnected using multiple VPLS Layer 2 circuits and BGP routers for IP high availability. It started out relatively simple (as things often do) and looked like this:
Transitioning to AWS
Before we get into our cloud journey, it’s important to clarify how we use a partitioning strategy for our infrastructure at Zendesk. We have multiple partitions distributed around the world, where each one has everything (services, datastores, workers, etc) that is needed to run Zendesk. Customer accounts reside in one home partition.
In 2016, Zendesk decided to build a partition out of our data centers and we picked Amazon AWS as our Cloud provider. If you're curious about the whys, make sure you check Part 2 of the three-series posts from our VP of Engineering, Jason Smale, where he goes down the rabbit hole on Zendesk infrastructure history.
As we spun up our first VPC in Tokyo, we started our hybrid journey. During this time we decided not to leverage VPLS or Direct Connect circuits. It didn’t make sense to pay for a private circuit all the way to Japan while considering our low throughput utilization (mostly code deployments, write queries to our global account database cluster, and Hashicorp Consul Gossip protocol traffic). Instead, we decided to spin up VPN tunnels from the data centers to AWS.
This is how it looked:
At this stage, we had two firewalls per datacenter, with two site-to-site VPN tunnels per firewall to the AWS VPC. This meant each data center had four tunnels to AWS, but only one was active. Consequently, we had high availability but subpar performance. Although not perfect, we ran steadily with this model throughout 2016.
In 2017, we began our journey to decommission the data centers and morph into a cloud-native platform. As such, we fully committed to migrating all the data centers services to AWS. We said to ourselves, “The Tokyo partition came out working well. Let’s build another partition in Oregon!” That’s where we encountered some complexities.
At that point in time, when we built a new VPC, we had to come back and touch the configuration of all the existing environments 😱 This meant performing firewall changes in four data centers and all other VPCs to make sure they were configured with new VPN tunnels and their route policies. This required a manual config plan and peer review to make sure all IPs, subnet masks, keys, tunnels, BGP peers and policies, etc. were all properly set.
Even though we were only creating new tunnels (we would not touch existing tunnels), the data center firewalls were also at the edge of multiple partitions for egress traffic. It was the type of slightly alarming risk where you say, “if we break something in this firewall we may disrupt cross-region traffic, and might interrupt all local data center traffic going out to the internet!” That never happened, but it was nonetheless very time consuming to make sure the config plan was ready and safe to be executed. A change execution plan considering our eight firewalls would span something like 500 lines, just to add one more VPC! Even with config management tools, the risk at each addition was still present and the overall solution wasn’t scalable.
We started to get concerned. We knew we’d eventually need to create at least five more VPCs as we transitioned to AWS. With two VPCs, we already had 33 IPsec tunnels to manage, while only nine of them were active (in other words a lot of overhead). We also had another problem, at that time, AWS had no VPC peering feature, so we had to build the inter VPC connectivity ourselves by leveraging EC2 instances with an open-source implementation of IPsec called Libreswan.
As the new partitions were created in AWS, we started moving masses of services and data from the data center partitions to the new AWS partitions. After some time, AWS launched VPC peering, but initially, it was not available in all the regions we required. We ended up with a mix of VPC peerings, VPNs between VPCs, and VPNs down to the data centers. It was very complicated to scale. This was the spaghetti situation we found ourselves in.
For everyone’s benefit, we went back to the whiteboard and said, “we have to find a way to make this simple, cost-effective, quick and safe to deploy.” We wrote down these requirements:
- Provide full-mesh encrypted and dual-stack connectivity
- Support on-premise data centers and multi-cloud providers
- Scalable data-plane
- Describe the network stack as code
- Fully automated: bootstrap, scaling, and self-healing
We came up with a model to build a Dynamic Multipoint VPN (DMVPN) network using the latest tooling available for describing the infrastructure as code. Internally, we needed a name for this new project. When I first showed my co-worker our current VPN network diagrams with all those tunnels, his instant reaction was to panic. He said, “that looks like Medusa!” … and from then on we called it Project Medusa. The name seemed particularly fitting as engineers tended to turn to stone when assigned to execute or peer-review a new VPN config plan 😅
Where we’re at now
Our current architecture involves a suite of protocols that includes IPsec, Multipoint Generic Routing Encapsulation (mGRE), Next Hop Resolution Protocol (NHRP), and Border Gateway Protocol (BGP). The setup we run today was built in-house based upon open-source.
Let’s talk about our motivations for taking this path.
- Linux portability so could run in datacenters + multi-cloud providers
- Cost-effective, we didn’t want to purchase network virtual appliance licenses for new VPC builds
- Integrated with our existent monitoring toolings, such as Datadog for performance metrics and logs, and ThousandEyes for packet loss, latency, jitter, path visualization and etc.
- Flexible foundation for extending functionality and building add-ons
At the time, there was no product offering that fits all of our requirements. So we built it ourselves.
This network model starts with a hub and spoke topology. In order to discover the entire network topology, each Spoke connects to Hub routers receiving the network prefixes feed by BGP. Direct tunnels between all the Spokes are then dynamically created, so communication happens directly. This technique doesn’t require us to statically configure all the full mesh tunneling between spokes, simplifying a lot of the initial configuration.
The next question is: how do we start building, and where do we put the DMVPN hub routers?
At Zendesk, the majority of our VPCs are home for Zendesk partitions, but, we also have regional Shared Service VPCs. These outlying VPCs host services for multiple partitions. Even amongst our own engineers, it’s not common knowledge that we have Shared Service VPCs. They are a cool bit of our infrastructure that is becoming more critical over time.
We decided to put DMVPN hub routers in those Shared Service VPCs. It fits like a glove as it’s a neutral ground VPC (does not belong to any partition), and they were already present in our infrastructure with some dial-tone services such as LDAP and DNS available, which allowed for us to save some time. We deployed a pair of EC2 DMVPN hubs in each of these two regions: us-east-1 and eu-central-1. They act as the network port of entry for the private backbone.
To connect a partition (Spoke) VPC into the DMVPN network, we deployed EC2 instances (Medusa routers) in an autoscaling group for each availability zone. We also leveraged our custom Ubuntu 18 based AMI to include the DMVPN daemons such as IKE, IPsec, openNHRP and BIRD for BGP (all required protocols are now bundled in most recent releases of FRRouting).
We also enabled ECMP Layer 4 hash policy for multipath routing in the Linux kernel
(net.ipv4.fib_multipath_hash_policy = 1) and BGP Additional Paths extension in BIRD
(merge paths on; add paths rx;) so we’re no longer capped to a single ingress path. All tunnels are now active using per-flow load balancing.
We had multiple challenges while fine-tuning the Medusa Routers in our staging environment to achieve its best performance before we could go live to production. We started our development using the stable 4.10 Linux Kernel, but soon we found out that the most recent (at the time) Kernel 4.12 was required to enable Layer 4 multipath.
After upgrading our routers from Kernel 4.10 to 4.12, instead of the desired performance improvements caused by load-balancing across multiple tunnels, the results were a drastic reduction in all IPsec throughput. After tracing down the issue to a recent kernel commit, I reported a Kernel bug and cooperated with a very cool netdev contributor for coming up and testing a patch that would then be later upstreamed to the next Linux kernel 4.12 release. Fun times!
Another important piece of the Medusa router is how it propagates the BGP learned routes from the DMVPN network to the VPC. A daemon written in Python runs in every router and is responsible for hijacking all non-local VPC private traffic, so the traffic can be routed further to its remote destination in the network.
It looks like this:
Everybody’s next question is, “how does the routing work?”
At a basic level, Medusa routers inject RFC1918 entries
(10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16)into the VPC routing tables within its own availability zone. The next-hop for the injected prefixes is the Medusa router's own Elastic Network Interface (ENI). All the VPC private traffic that is destined for somewhere outside the VPC is then routed through Medusa routers.
We implemented a distributed lock mechanism to enable high availability. If one Medusa router goes down, the failover happens very fast as another healthy Medusa router in the same availability zone replaces the existing RFC1918 routes’ next-hop to use its own ENI (analogous to VRRP, but as AWS doesn’t support Multicast we had to write our own). This mechanism also enables us to roll out upgrades by bringing up a new generation of Medusa routers with a new version of AMI, allowing traffic to gradually converge from the old routers to the new routers and then shutting down the old routers. All happens while not losing a single packet!
As VPC best practices, we create one public routing table where the default route is an Internet Gateway (IGW) and three private routing tables (one for each availability zone) where the default route is an in-zone NAT Gateway. This means when private instances are going out to the Internet, we don’t incur cross-AZ traffic and the extra costs associated with it. We apply the same principle by deploying a pair of Medusa routers in each AZ.
Fortunately, this setup has also solved another problem. AWS has a known hard limit of 50 routes per route table. If we had kept creating bi-lateral routes, we would have hit that wall very soon. Medusa routers act as an aggregator by injecting only three route entries in their own AZ routing table. Meanwhile, the Linux kernel routing table has full network topology with specific routes via direct mGRE tunnels, all fed by the DMVPN Hub routers using BGP. This approach dramatically simplifies the process of bringing new VPCs to the network, as bootstrap scripts only need to be aware of the IP addresses and keys for the four DMVPN hub routers.
At this point in the story, people usually say, “Packets are routed from one VPC to another through the hub routers as a triangle? How’s that different from AWS Transit VPC?”
New Hop Resolution Protocol (NHRP)
NHRP is one of my favorite moving parts because it’s where most of the magic happens. Let’s have a closer look.
In the diagram above we have two spoke VPCs. If we want to send a packet from one VPC to another, the traffic must initially go through the hub as we don’t know the direct route yet. In turn, the hub forwards the packet to the destination VPC. Of course, there is a latency penalty because the packet goes through the hub as a mediator, but NHRP on the hub will then send a redirect message to the origin VPC Medusa router with the destination VPC’s Medusa router public IP address, basically saying, “set up a direct tunnel to your destination router and stop bothering me again.” The origin initiates a new IPsec tunnel to the destination and all further communication takes place in this new direct tunnel.
It’s fun to see this happening in practice. When you log in on a node in the origin VPC and issue a traceroute to a node in a ‘not yet direct connected’ VPC, you see the path has 3 hops and high-latency, as it goes through the hub. The second traceroute path shows 2 hops and latency drastically reduced, as traffic converges directly between the two VPCs. It’s impressive to see the network converging with all the dynamic full-mesh tunnels being created in real-time.
Essentially it’s a model where VPCs are initially connected to the hubs only. As we scale, and traffic starts flowing between VPCs, direct tunnels are built on demand from one router to all the other routers. After full convergence, the connection to the hub is avoided, and only direct tunnels are carrying data, forming a full-mesh topology.
It looks like this:
Connecting a VPC with Medusa
I’ve described our current backend, but it’s at the front line that Zendesk engineers see the benefits of an architecture that uses infrastructure as code. It gives them speed, simplicity and the ability to spin up an entire infrastructure using a single command. We use Terraform for our Medusa infrastructure tooling. These days, if an engineer needs to connect an existing VPC to Medusa, there are five parameters they need to input (instead of 500 lines like we had in the old days):
- AWS region
- Availability zones
With the correct approvals and privileges, we’re now able to fully connect VPCs within minutes.
Where we want to go
We’ve made good progress over the last two years, but we want to go further! Our cross-region traffic patterns have changed a lot from the early days of our journey — they have become much more critical and grow every day with the launch of new global microservices and as we ramp up Istio based service mesh in our Kubernetes clusters.
Managing EC2 instances for the network data-plane turned out to be more daunting than we expected and have a higher overhead than we were used to with our physical routers in our data centers. We have to scale the instance type up or increase replicas during network utilization spikes, build and rotate AMIs for patch management, watch for common network metrics as bandwidth and packets per second utilization, but new ones also that are intrinsic to the cloud environment, such as CPU Steal Time.
Also, we want the process of creating a fully connected VPC to be so user-friendly that our engineers don’t even need to describe with Terraform. The vision is to have this process integrated as part of our deployment pipeline, where a single flag indicating the need for a new VPC for a deployment type will trigger Foundation Network owned Kubernetes Operators to build all the required resources and connectivity.
In Part 2, I’ll talk about how we’re transforming Medusa to use cloud-native components, and building the self-service future where our engineering community can create fully connected VPCs to host new environments and Kubernetes clusters (we’re a big fan of K8s here).
We also want to end-of-life a lot of the functionality within Medusa (EC2 routers, IPSec, etc) by moving to the new AWS feature called Transit Gateways. We plan to create one TGW in every region, peer them with each other, and attach all of our VPCs to their respective regional TGW. Of course, we would also need to update routing tables within each VPC as we do with Medusa (and between TGW peerings, as that won’t be supported by AWS out-of-the-box), but this time we intend to leverage AWS Lambda to run our routing table propagation code.
We’ve made good progress in research and development using AWS Transit Gateways, but the ability to peer two or more transit gateways in different regions is still missing. The absence of this critical feature is completely blocking us from going further. We have our fingers crossed that AWS will make this available soon at the next Re:invent in December 2019.
Speaking of AWS Re:invent, here is my lightning talk about Project Medusa during the 2017 annual edition in Las Vegas.