The Facebook Outage — A Postmortem

Sonny Dewfall
The Pinch
Published in
7 min readOct 29, 2021

We thought this week we would take a look at the recent Facebook outage in the form of a fantasy postmortem — what do we know about the incident and what do we think about Facebook’s response?

A caveat before we start; we only have the information that is available online at the time of writing and we may well have missed something so please let us know in the comments if we have missed anything (or just if you have anything to add).

Summary

In brief, at just before 04-Oct-21 16:00 UTC Facebook experienced an outage that brought down the entire suite of Facebook applications as well as WhatsApp, Instagram, and many of Facebook’s own internal systems. Full service was not restored until the following day (whilst partial service was restored earlier, WhatsApp was not fully operational until 05-Oct-21 03:00 UTC[1]).

Impact

It’s difficult to determine at exactly what point the service was degraded but by 04-Oct-21 17:00 UTC, users lost access to all services. We even saw Facebook employees unable to access their offices because keycards were tied to company LDAP[2].

From a severity point of view, we know Facebook’s daily user count is about 1.62 billion[3](although we should perhaps take this with a pinch of salt), WhatsApp has about 2 billion monthly users[4] and Instagram clocks in at 500 million[5]. We can assume a lot of these userbases overlap but that still means an impact of about 2 billion users. Besides this Facebook’s share price dropped by 5% on the day[6]. Facebook have also reported that no user data was compromised as a result of the outage[7].

Root Cause

Most of our root cause analysis will be taken from this excellent blog post from Cloudflare.

The root cause of the issue seems to be found in changes made to BGP (Border Gateway Portal). At 04-Oct-21 15:40 UTC Cloudflare report a spike in routing changes to BGP made by Facebook.

Almost immediately after these changes occurred, users were unable to resolve Facebook URLs, suggesting that Facebook’s DNS servers (Facebook is a large enough company to operate its own DNS servers) were offline because their routes had effectively been removed from the internet.

Once Facebook had removed their DNS prefix routes from BGP, DNS resolvers across the internet had no way of translating Facebook URLs into valid IP addresses. Some servers were probably able to resolve these URLs for a short period due to IP addresses held in memory caches but eventually these would expire. The result of this was that all services that tried to hit a Facebook URL were served an error response and were unable to connect.

Why did the change fail?

So, we know why Facebook disappeared from the internet, but what actually went wrong with the change to cause the incident in the first place? Yesterday’s blogpost sheds more light on this:

“During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally.”

Whilst what we saw from the outside was a change to BGP, the change itself was to Facebook’s internal routing systems. The blogpost goes on to explain that, as a failsafe, Facebook’s DNS servers disable their own BGP advertisements “if they themselves cannot speak to our data centers”.

This is fairly opaque from Facebook, but it at least tells us they know the issue was related to a change they made. What’s more interesting is what Facebook’s response tells us about SRE and operations culture in general at Facebook.

The Response

We know that Facebook has used SRE[8] for over a decade, albeit through their own derivative which they call “production engineering”, and we can see some of the principles and practices at work here. So as a trailblazer and global representative of Site Reliability, how do we expect Facebook to change their processes to prevent outages like this from occurring in the future?

Blameless Culture

It’s difficult to know exactly how blameless the reaction from Facebook has been internally. We may never know what the internal incident response was or whether anyone lost their jobs.

What we can say is that the external aspect of the incident response demonstrates blameless language. Santosh Janardhan calls out that the erroneous command that took down all services did so “unintentionally”. The post also never refers to an individual or even a team, instead we are told the issue was caused by a “command” and compounded by an “audit tool”. [TA1] [DSR2]

We are pointed to machine, rather than a human[BRJ3] , error which suggests that Facebook are trying to identify systemic issues rather than assigning culpability. Of course, we can only guess to what extent this reflects the internal response.

Separation of Duty

In his blog post, Santosh Janardhan specifically calls out the security of the servers that were at fault during the outage:

“They’re hard to get into, and … are designed to be difficult to modify even when you have physical access to them.[9]

This makes a lot of sense, of course you need to have security on such critical systems. An alleged Facebook insider on Reddit pointed to another layer of confusion here though:

“…the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do…[10]

If we believe this then it seems like there is a separation of duty issue here — shouldn’t it be the job of one team to deal with this kind of outage? Perhaps there are gaps in the collaboration tools that are being used to collaborate on outages, although this could in itself be related to the outage as, with LDAP down, responders may have had to run incident calls on alternative tools like Discord[11].

From the blog post, Facebook accept increased time to recovery as a tradeoff for increased security. This is a tradeoff that might become common in complex systems like Facebook.

‘DogFooding’[BRJ4]

…or “drinking your own champagne[TA5] –has long been an encouraged culture in modern engineering, with memes and quotes about how to build your best service or product, solid testing coverage etc. It is in essence the idea of using your own product/service to build or support your product/service, and Facebook have been using Facebook for what appears to be everything, including their own building access.

For example, Facebook’s domain registry company is ‘registrarsafe.com’, which also happens to be owned by Facebook itself. Hosting the registrar internally meant it suffered from the outage as well, causing their DNS records to be erased and further complicating the issue.

This poses an important disaster recovery conundrum — should there be a secondary, independent route into your toolset so you can recover it if needed? Is this worth the increased security risk of having more ingress routes?

War Games

Facebook’s engineering blog also makes reference to “storm” drills which seem to be simulations of incidents. Whilst these activities never went as far as simulating a comparable error on the backbone servers, it seems to have been useful in modeling the increase in load experienced after the service was brought back up. The experience garnered from these simulations allowed Facebook to manage the increased load as services were brought back online to prevent further outages.

Actions

By way of conclusion, we have sought to imagine some of the actions that Facebook might take to make sure this kind of outage doesn’t happen again.

1. Tooling — the first port of call would be the tooling used to make the change itself. Was automation used? If so, was it fit for purpose? One key action will certainly be to patch the bugged audit tool that didn’t catch the issue in the change.

2. Access — One of the biggest issues with the outage seems to have been that all employees were immediately locked out of internal tools and even the servers where the problem occurred. It wouldn’t be a surprise to see Facebook change the way access is managed, segregating it from their main backbone servers.

3. Redundancy — Whilst it’s difficult to understand the architecture of the backbone servers Facebook is using, it might not be a surprise to see them deploy some sort of redundancy here, perhaps having a “cold” set of servers that could be rolled back to in the event of another catastrophic failure such as this one.

4. CAO — Salesforce recently employed Darryn Dieken as a Chief Availability Officer[12]. Dieken is a C-suite level employee whose entire job is to ensure Salesforce is reliable. Might this be something that Facebook consider implementing in future?

5. Stability — Although the outage will fade quickly from public memory, the response will undoubtedly be significant within Facebook itself. Another outage of this magnitude in the near future would certainly affect Facebook’s reputation and user base. We might see the pace of change at Facebook slow down for a period as releases are double checked.

It will be interesting to hear how the rest of the aftermath from the Facebook outage pans out. Do you have any thoughts on how the outage went? Let us know below in the comments if there is anything you feel we are missing in our article.

Articles and comments are my own views and do not represent the views of my employer, Accenture.

[1] WhatsApp Official Twitter

[2] https://www.businessinsider.com/facebook-employees-no-access-conference-rooms-because-of-outage-2021-10?r=US&IR=T

[3] https://thesocialshepherd.com/blog/facebook-statistics-2021#:~:text=1.62%20BILLION%20users%20on%20average,population%20are%20daily%20active%20users!

[4] https://www.businessofapps.com/data/whatsapp-statistics/

[5] https://backlinko.com/instagram-users

[6] https://www.cnbc.com/2021/10/04/facebook-shares-drop-5percent-after-site-outage-and-whistleblower-interview.html

[7] https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

[8] https://engineering.fb.com/2010/02/08/data-center-engineering/site-reliability-engineering-at-facebook/

[9] https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

[10] https://interestingengineering.com/insiders-posted-about-the-cause-of-facebooks-fall-heres-the-details

[11] https://twitter.com/tomwarren/status/1445136146095349760

[12] https://www.geekwire.com/2020/meet-chief-availability-officer-salesforce-names-seattle-area-exec-unusual-c-suite-role/

--

--

Sonny Dewfall
The Pinch

SRE, DevOps and Quality Engineering specialist at Accenture.