Feb 19, 2020 Raidbots Outage — What Happened

Published in

Raidbots

6 min readFeb 20, 2020

High traffic + server configuration error + being off-the-grid = site go boom.

The very quick version: Raidbots broke and I was in a remote area of the country without cell service or my laptop so I couldn’t fix the problem quickly.

I’m deeply sorry for the outage — my first priority is always making sure the site is stable and I wasn’t able to provide that yesterday.

If you subscribed on Feb 19th during the outage and want a refund, email billing@raidbots.com with the subject “Outage Refund” and I’ll get back to you as soon as I can.

Some technical details of the issue are at the end of this post. If you want the dramatic version of the story from my perspective with some nice photos, keep reading.

The Long Version

Raidbots went down hard today. It was particularly bad because I was very far away from the internet — the interior of Yellowstone National Park in the middle of winter.

Yellowstone National Park, Hayden Valley

During winter, most of Yellowstone is inaccessible for normal vehicles — you can only use these snow coaches and snowmobiles since the roads can be covered in several feet of snow. I was on a ~10 hour tour on one of these coaches with some other photographers hoping to find some of the more exciting and elusive Yellowstone animals (when I’m not working on Raidbots, my wife and I are often out photographing wildlife).

In the middle of our tour we were in Hayden Valley and had just spotted an ermine on the side of the road (ermine is the fancy name for a short-tailed weasel in their winter coat). There are tons of them in the park but they are incredibly well camouflaged and pretty rare to spot so we jumped out of the bus and stated taking photos.

These little guys are unbelievably cute and very difficult to find

This one hopped around pretty close to us for about 15 minutes

Cell data is extremely limited in the park — there’s almost no service on most of the roads and only a tiny bit at the visitor centers. While we were photographing this ermine, my phone got a sliver of a bar of connectivity and just started exploding with notifications. Discord pings, monitoring alerts, twitter mentions, emails, and more.

I was briefly able to connect to Discord, see there was a massive problem with the site, get a few quick messages out, but the tour had to keep moving so I lost signal again without being able to fully identify the problem or apply any fixes.

Once we arrived at the Canyon Visitor Center about an hour later to warm up and eat lunch, I was able to get a little bit more information about what was going on — cell data was still extremely limited so loading web pages was still very slow and often just failed outright.

I was able to see that one of the site’s safety mechanisms had kicked in to block sims because it looked like the site was going into a failure cascade. I was able to re-enable sims from my phone and had about 10 minutes to watch while some sim traffic started up again.

We then started the long trek back to civilization. While luck was not with me in terms of server stability, we did have an amazing encounter with a pack of wolves at sunset.

Probably some of the younger pups from the Wapiti pack

A group of the pack walked right by our snow coach

They all howled as a pack for a couple of minutes

Once the tour was finished, I got back to my laptop and was able to start seeing more details of what happened, get the site back into a working state, and monitored for the rest of the night.

The Key Problems

There’s a number of things that happened that set the stage for this failure:

Raidbots traffic is the highest it has ever been. High traffic excels at exposing weaknesses in complex systems. Peak traffic for 8.3 has resulted in Raidbots running thousands of servers to keep up — it’s about double what was required for 8.2.
I migrated some Redis databases to a new backend (Google Memorystore) before 8.3 to be able to handle higher load and get better data on server usage.
Memorystore is limited to single region but I run servers in multiple regions. This required setting up some additional haproxy servers to allow cross-region communication and I don’t have a ton of experience with haproxy, especially in high traffic situations. I had seen some configuration weirdness with these servers but hadn’t been able to pinpoint the underlying problem.
Nearly all Raidbots servers require access to a single Redis instance that manages the queue state. This is an architectural decision that was made early in the design of the site that is exceedingly difficult to change without significant engineering work that I haven’t been able to get to.
I built some automated safety mechanisms a few patches ago that worked to stop sims when there was an error but they were not able to automatically restore the health of the site and they result in error messaging that is wrong/confusing.
I had encountered some issues with haproxy configuration over the last few weeks of high traffic and thought I had fixed them. Turns out, I had not.
I was away from the internet for 10 hours.

The actual failure cascade looks to be something like this:

The maximum connection limit in haproxy reset to its default value (2,000 connections) from the configured value (8,000 connections).
High traffic resulted in enough servers to reach that lower connection limit.
Critical servers lost their connections to Redis and couldn’t reconnect.
At that point, all hell broke loose. In particular, Warchief was unable to manage the queue protection (the task that blocks/unblocks sims in case of emergency) nor could it scale servers down which might have reduced the overall pressure on the system.
None of the systems could automatically recover in those conditions so things were broken until I could intervene.

What I’ve already done:

Implement an incredibly hacky cron job band-aid that may solve an haproxy configuration issue.
Be at a computer next week during the high traffic of the weekly reset so I can manage any problems immediately.

Medium term plans (next few weeks):

Try to find a durable, non-hacky solution to the Redis connection limit issue that likely triggered the entire failure cascade. I should be able to do this in the next week or two.
In addition, ensure critical servers have a priority connection that bypasses the limit or skips that proxy entirely.
Put in some more work on the safety mechanisms to provide better messaging and add some more abilities to try to automatically recover in case of failures.

Long term plans:

Rearchitect the site to remove the dependency on the single bottleneck Redis instance. This is major work and will likely happen over the summer before Shadowlands.

Feb 19, 2020 Raidbots Outage — What Happened

The Long Version

The Key Problems

Written by Seriallos