Facebook’s Recent Outage Explained

Jag Tangirala
3 min readOct 6, 2021

--

This blog tries to explain the recent Facebook’s outage with a little more detail by providing some background. This builds on top of the following Facebook Engineering’s note about the outage:

https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

First a disclaimer though — I neither work at Facebook nor aware of any details about the Facebook’s network/outage other than the information that is available in the public domain and my understanding of the same.

As shown in the following picture, the Facebook vast network is comprised of various Data Centers (DCs), an Internal Backbone, a global DNS load-balancing system and dozens of Points-of-Presence(PoP) connected to the Internet. Each of these is briefly explained below:

Datacenters (DC): These are the facilities that house tons of servers hosting various Facebook’s services like Newsfeed, Messenger, Livestream etc., and the associated networking routers/switches and storage.

PoP: A PoP is a small remote facility used for connection to local ISPs. Facebook uses dozens of PoPs across the globe to serve its billions of users across the world. PoPs allow Facebook to extend their network to some place near to the users and to better serve them. These PoPs also host servers, networking switches/routers and storage on a relatively smaller scale.

Internal Backbone: These are private WAN links that connect DCs with each other and with the PoPs. (Facebook has two backbones — one connecting all DCs to carry the so-called machine-to-machine traffic, and the other is to connect DCs to the Internet via PoPs as shown above).

Global DNS Load Balancer: Users connect to various Facebook services from their devices via their ISPs/Internet. A global load-balancing DNS (Domain Name System) then assigns users to the best PoPs. The DNS is a directory infrastructure that translates domain names like facebook.com to an IP address of a Facebook’s server. The best PoP for a user is determined by various conditions like capacity, latency, route, health etc., of the PoP.

There are thousands of DNS servers running as part of the Facebook’s infrastructure and an ISP’s DNS resolvers map a Facebook’s domain name to these Facebook’s DNS servers. These DNS servers are continuously fed with real time data about the best PoPs by a Facebook’s software system that keeps analyzing various PoPs’ health. BGP (Border Gateway Protocol) is the routing protocol through which these DNS servers are advertised to rest of the Internet.

So, with the above background, let’s see what has happened:

  1. Facebook issued a command during a routine maintenance job to check the backbone’s capacity — this command took down connections on their backbone network, disconnecting all their DCs from the PoPs/Internet and the global DNS load balancing system.
  2. The Facebook’s global DNS load balancing system is designed to stop advertising about their DNS servers to rest of the Internet if they themselves are unable to talk to the DCs. Since the latter is the case here, the DNS information about Facebook is withdrawn from rest of the Internet leading to the outage.

--

--