That massive Slack outage this month? What caused it?

CC
Lagaram
Published in
3 min readJan 23, 2021

The first working day of the year was relatively quiet for many slack users.The barrage of slack nessages from multiple channels were missing on the morning of January 4,2021.

What caused this major outage?

The issue initially started with scaling issues and slowly cascaded across Slack's infrastructure hosted predominantly on AWS.Based on sources familiar to Protocol, the root cause of the initial issue points to AWS Transit Gateway

"around 6:00 a.m. PST we began to experience packet loss between servers caused by a routing problem between network boundaries on the network of our cloud provider."

AWS Transit Gateway did not scale fast enough to accommodate the spike in demand for Slack's service the morning of Jan. 4, coming off the holiday break.

What is Transit Gateway ?

A gateway is in simple terms is a router at the periphery of your network acting as a stopping point for data flow to a network. Transitr Gateway enable Services in one network (NPC) Communicate with a resource /service in another network. These networks can be across different AWS accounts, remote or on premise network.

Courtesy: AWS

Why slack started using Transit Gateway ?

The messaging service though a household name now began as the internal communication tool of a game company building Glitch.Slack went through an evolution of sorts on using AWS infrastructure.From the early days of using few hand built EC2 instances to an infrastructure spanning thousands of servers across multiple AWS regions , using several AWS services to build a scalable solution for hundreds of clients today.

Stage 1 : All eggs in one basket

As the customer base grew , Slack built out several other services and provisioned multiple servers on AWS infrastructure but all in one big AWS Account. This resulted in a management overhead and rate limiting for various AWS Services

Stage 2: A sense of sanity

To bring in some sanity , Slack started creating child accounts and used VPC peering to connect them . This was great for a while , but as the company continued to grow the number of child accounts grew. Managing CIDR ranges and IP space for hundreds of AWS accounts was a serious management nightmare.

Stage 3: One step closer

With the introduction of AWS shared VPCs , Slack Cloud Engineering team devised a plan to create VPCs in one AWS account and share it with other accounts . This takes care of all previous issues of rate limits , maintaining CIDR ranges and IP space. This was built out in the primary region first followed by all other AWS regions. The challenge now was to connect regions with each other and with the primary region in us-east-1.

Stage 4: Here comes Transit Gateway

AWS introduced a feature in late 2019 called Transit Gateway Inter-Region Peering.So each region's local VPC got its own Transit Gateway which was connected to Transit Gateways from other regions to build inter-region connectivity.

Courtesy: Slack Engineering

References:

  1. https://slack.engineering/building-the-next-evolution-of-cloud-networks-at-slack/
  2. https://www.linkedin.com/pulse/transit-gateway-best-networking-happens-cloud-santosh-deshpande#:~:text=Introduction%20to%20Transit%20Gateway%20(TGW,one%20network%20to%20the%20other.&text=In%20addition%2C%20TGW%20simplifies%20networking,Gateway%20is%20scalable%20and%20resilient.

--

--