Analysis of Streamcord’s outage on February 24th

Akira
Streamcord
Published in
8 min readFeb 27, 2022
The outage, as visualized by Streamcord’s status page.

Hundreds of thousands of Discord servers rely on Streamcord to provide them with up-to-date Twitch notifications and Live Role. Uptime and consistency are some of the most important aspects that our users rely on, and unfortunately, we recently failed to meet this need. On the morning of February 24th, our internal systems suffered a failure that resulted in a disruption of service for notifications and Live Role that lasted approximately 2 and a half hours, followed up by a further disruption of Legacy notifications service that lasted for another 21 hours.

We apologize for the downtime and any disruptions caused by it, and we’re making changes to ensure that this kind of outage never happens again.

What caused the outage

Around 6:12 AM Eastern Time, our servers began experiencing occasional network issues. This caused some of the bot’s websocket connections to Discord’s gateway to be dropped, prompting the need for them to connect again. This continued for several hours leading up until the outage.

Graph of shard connects, resumes, and disconnects before and after the outage.

Because connecting to the gateway can be intensive, Discord limits most bots to 1000 connections per 24 hours. Streamcord, along with many other popular bots, are granted an extension of this limit to 2000 per 24 hours.

So, what happens if a bot tries to connect too many times?

According to Discord’s API documentation,

Upon hitting this limit, all active sessions for the bot will be terminated, the bot’s token will be reset, and the owner will receive an email notification. It’s up to the owner to update their application with the new token.

A bot’s token acts like a password, which tells Discord which bot is trying to connect, and that it has the proper authentication to do so. Hitting the connection limit means that Discord will reset the token, preventing the bot from communicating with Discord in any way. The owner of the bot must reset it via Discord’s developer portal.

This is what happened to Streamcord. The graph below shows how many times Streamcord attempted (successfully and unsuccessfully) to connect to Discord on the morning of the outage. You can see large spikes at around 6:00 AM on both the 23rd and 24th.

Graph showing the number of times that Streamcord tried to connect to Discord before the outage.

At 10:56 AM, Streamcord’s token was reset because it surpassed the 2,000 connections per 24 hours limit.

Now, this probably leaves you with a very important question: doesn’t Streamcord have anything in place to prevent this?

The short answer is no, otherwise the outage wouldn’t have happened.

Here’s a basic map of our backend systems, separated by programming language. Parts in blue were affected by the initial network issues explained above (and later by the token reset), and services in red were affected only as a result of the token reset.

Map of Streamcord’s infrastructure.

So, because the token was reset, Live Role, commands, Legacy and Spyglass notifications, and the dashboard stopped working. I wasn’t able to access my computer when this happened, leaving me unable to fix the outage.

Even though the token was reset, the Legacy notifications process continued to run. It attempted to send notification messages, using the Discord API to do so. However, because the bot’s token was reset, Discord didn’t allow Legacy to do so. Whenever it tried to send a request to Discord, the bot received a response just like this one:

{"message": "401: Unauthorized","code": 0}

Discord has a limit on how many invalid requests can be sent to the API in order to cut down on server load from bad actors. This is known as the infamous “Cloudflare ban” among developers.

According to the documentation,

IP addresses that make too many invalid HTTP requests are automatically and temporarily restricted from accessing the Discord API. Currently, this limit is 10,000 per 10 minutes.

However, this can sometimes affect legitimate services like Streamcord. 10,000 requests in 10 minutes seems like a lot, but with services at large scales, this can be quickly reached in the event of an improper configuration, or in our case, a token reset. Considering that Legacy manages almost one million notifications, it didn’t take much for us to reach that threshold.

Thankfully, though, the outdatedness of Legacy actually saved us from further downtime in this case; we were running an old version of the discord.py library that still referenced discordapp.com, Discord’s old domain. This allowed for a “loophole” against our Cloudflare ban — leaving Spyglass notifications untouched and able to function, despite Spyglass (which uses discord.com) being hosted on the same server and IP address as Legacy.

How we fixed it

Once I was able to reach my computer, the process of fixing Streamcord was fairly simple. All that needed to be done was to reset the bot’s token from the developer portal, update the config files that referenced the token, and start the bot.

The bot also needed to be resharded, which took an additional 20 minutes for the configuration to be updated. Resharding is the process of upscaling a bot’s connection to Discord’s websocket gateway to cope for the increased demand of being used in more servers.

However, the Cloudflare ban caused by Legacy was a whole other task. Unfortunately, there was nothing we could do for 24 hours (the duration of the ban) besides to turn off Legacy to let it “cool down” and avoid wasting CPU and network resources.

After the ban ended, it was as easy as starting the Docker containers for it again.

What we’re changing because of it

After this outage and the impact that it had on Streamcord, we’ve learned a few lessons that we can use to prevent something like this from happening in the future.

  • We need to properly handle invalid and disconnected sessions on the gateway between clusters
  • We need to increase communication between our services to prevent Cloudflare bans
  • We need to create automated alerts to inform administrators of problems before they result in downtime

We can apply these lessons to our backend infrastructure and in turn implement these systems:

  • Proxying all of our requests to Discord to prevent excessive 4xx errors and mitigate Cloudflare bans
  • Pushing up the timeline of deprecating Legacy notifications
  • Adding automated alerts for the dev team when there is an abnormally high number of shards connecting to Discord
  • Investing more development resources into the rewrite

We believe these are all much-needed changes to Streamcord, and will make it much easier to continue to scale our services in the future. When implementing a microservice architecture for your backend, as Streamcord does, there are additional challenges that must be dealt with when communicating with third-party services, such as Discord and Twitch. We’ve already built and implemented our own solution to Twitch ratelimits in production and are looking into proxies like twilight-rs/http-proxy for Discord.

Additionally, we’re working hard to replace outdated infrastructure on our backend and use more modern, supported, and efficient technologies.

Let’s take a look again at our infrastructure map:

Map of Streamcord’s infrastructure.

The blue services — all linked together as one program — serve many important functions, from giving the bot its online status in the member list, handling commands and Live Role, and communicating with the dashboard. Unfortunately, it is also one of the oldest parts of Streamcord’s backend, also written with discord.py, much like Legacy notifications are.

In case you don’t follow any of the drama within the Discord developer community, here’s the TL;DR of the situation with discord.py. Danny, the lead maintainer of the library, disagreed with the decision made by Discord to force bots to switch over to slash commands.

Now, the terms of this disagreement and Danny’s choice are beyond the scope of this article. However, this left a lot of bots like Streamcord with uncertainty about the future. We wondered if we were going to be forced to rewrite the bot to a different language and didn’t know if the bot would break with the addition of any new features or changes in the API.

Eventually, we did decide to rewrite Streamcord’s gateway connection using our own custom code, based upon discordgo, but at the time of writing, it’s still a heavy work-in-progress. We’re also working on rewrites for Live Role and slash commands, all of which are currently being tested with our PTB program.

This obviously leaves us in a bit of an awkward spot until we can finish the rewrite. While discord.py does have a couple of established forks now, namely pycord and nextcord, but until we can release our own custom slash command handler, we’ve determined that it wouldn’t be worth the development effort required to switch.

So, until then, we’re sticking with discord.py. It’s worked well enough for Streamcord so far, but it’s definitely not here to stay. The code is definitely not suitable for a production setting with over 560,000 servers.

On the other hand, a piece of backend infrastructure that we’re working very hard on to remove are Legacy notifications. This is the original notification system that’s been used ever since its initial release in 2018 (although it has gone through a few updates throughout the years).

We’ve been trying to replace it since 2019, ever since we released Webhooks notifications (the predecessor to Spyglass).

Now, nearly 4 years later, we finally plan to deprecate and remove Legacy notifications. To clarify, we’re not going to delete any existing notifications that were set up with Legacy; instead, they’ll be transferred to Spyglass automatically. We’ll have a blog post with more information shortly.

At Streamcord, we’re constantly working to improve the experience our users have, and unfortunately, the outage took us a step backwards. Nevertheless, we hope to learn from this setback and use that knowledge to create a better service.

As a side note, if you’re looking for some freelance work related to backend and/or frontend development, we’re looking for some talented programmers to join our team. Please email akira@streamcord.io or DM Akira#1000 on Discord if you’re interested in a paid independent contractor position.

Or, if you want to test some of Streamcord’s beta software (like the rewrite mentioned above), you can join our PTB program at https://discord.gg/kAYysbtrWN.

--

--