Yammer Datacenter Networking with BGP and BIRD

Published in

Yammer Engineering

5 min readAug 3, 2016

It’s been 4 years since Microsoft acquired Yammer, and we easily remember how being acquired was our springboard into a bustle of structural adaptations. An acquisition by such a large company alone presents all kinds of new and exciting challenges. Not to mention, Yammer is a pretty unique product under the Microsoft umbrella; we’re a Linux shop.

Microsoft-owned datacenters are, not surprisingly, primarily built around the use of Windows. For that reason a lot of the great datacenter automation tooling handed to us didn’t translate to our stack. This presented us with a great challenge: building out an automation framework that could run in our new datacenters, which also meant leveraging Microsoft datacenters to increase our global footprint. So for the past year, we have operated out of multiple datacenters using a hybrid BGP (Border Gateway Protocol)-based networking infrastructure. Here’s how we did it.

Configuring Our Network

Yammer’s Datacenters are managed by an internal project called “Zeus.” One aspect of Zeus is helping to configure the internal networking for our datacenter. We consider resource efficiency important to our configuration. In part, this means not having to configure routers to bring new hosts online, and with BGP we don’t have to. BGP affords us the appropriate ASNs (Autonomous System Numbers) we can use to advertise the IP of newly provisioned hosts to the rest of the DC fabric. As long as we have enough IP space, the new hosts can become routable.

Every physical host in our datacenter is acting as a router. Using BIRD we are able to advertise the routes for the host. For our services we leverage the use of LXC for containerization. When a container is started, it is given an IP address by Zeus, and then we reconfigure BIRD with the new IP to advertise. The ToRs (Top of Rack switches) pick up this IP, and through the magic of BGP, the container becomes routable.

An example bird configuration. Replace $ variables with your values.

As shown in the example configuration, we assign each host a unique ASN that is trusted by our ToRs. We then configure each host to act as a router for the container. It’s important to note that the use of communities keeps the load on our routers low. The bgp_export filter only accepts traffic intended for the host that originates from a neighbor. And since each router only needs to know how to route a wider subnet — as opposed to each individual IP — communities allow you to advertise an entire IP slice to the higher-level routers.

Additional Applications for BGP

We use BGP for more than just routing to individual hosts. By advertising a single IP from multiple hosts, we are able to create a networked-level failover for certain systems such as our Puppet mirrors and DNS.

To advertise a single IP, we just make a few modifications to our previous BIRD configuration.

Another bird snippet

With this configuration we can route this single IP as a sink from multiple sources. This same technology is used in our software-defined load balancer.

Here we use 3 hosts running LVS as a sink for incoming traffic that then forwards to multiple HAProxy hosts to do Layer7 load balancing. In Zeus we call the above topography a “Cell,” and it allows us to scale our infrastructure as needed. If we need more network throughput, we can add LVS nodes. If we are having issues on Layer7 or SSL termination, we can add more HAProxy hosts. Currently we are running 4 completely separate Cells for network isolation of tasks.

Hangups

Implementing this software-designed networking stack didn’t come without issues. Using BGP on our internal networking infrastructure required tweaking multiple settings throughout the datacenter fabrics. For example, each router has to be configured with BGP communities in mind. Our initial rollout did not include these communities and this put pressure on the routers two levels above our ToRs.

We also had to be conscious of the BGP routing method used. If you plan on advertising a single IP from multiple sources — like in the LVS example above — you should make sure you have your Equal Cost Multipath (ECMP) configured correctly. If ECMP is not enabled or misconfigured, it’s possible you will experience sub-optimal performance; packets will flow between hosts since they all share the same BGP weights.

Since our HAProxy hosts could be on different subnets, we have to use IPIP tunnel mode for LVS. This has the added overhead of 20 bytes on all of our packets passing through the load balancer. Keep in mind that the extra overhead is not considered when calculating the segment size of IP traffic. This resulted in an issue where any packet that is greater than our configured MTU-20 bytes becomes malformed. On our HAProxy hosts, we used MSS clamping in IPTables to ensure a maximum size of MTU-20 bytes, which combats the issue.

Moreover, the use of BGP gives us great flexibility in launching our systems but shouldn’t be relied on for detecting failures. Routers use keepalive packets configured on intervals to refresh the BGP hold timer. These values present the maximum amount of time needed to react to a failure in routing. Keepalive packets only reset the hold timer, so if you configured your routers to have a 90s hold timer and a 30s keepalive timer, then on network failure you could see as much as 90s of downtime before your routers reconfigure.

BGP has greatly increased our fault tolerance to network failures and has allowed us to provision new hosts for our services team extremely quickly. However, if you plan to do rolling restarts of multiple hosts sharing an IP, I suggest using BGP path calculations to artificially take a host out of rotation by increasing the length of your path.

Yammer Datacenter Networking with BGP and BIRD

Configuring Our Network

Additional Applications for BGP

Hangups

Written by Kyle Gordon