June 27 Site Issues - Post Mortem
Visitors to Raidbots today may have run into some pretty major issues with the site. I want to provide some details on what happened and what I’m doing to make sure it doesn’t happen again. I love reading honest and transparent post mortems from other folks running large systems so I figure I’ll throw my own out there!
I want to apologize for the outage — especially on a busy raid night for a lot of folks. This type of failure was one I hadn’t seen before and I intend to ensure this specific failure won’t happen again. Hopefully I can build in some additional safeguards and redundancy to make this kind of outage less likely in the future.
Weds, 2PM PST: The worker cycling issue cropped up again this morning despite increasing the size of Redis. I also made a change to the report URLs to potentially avoid a possible duplicate ID scenario I had observed. I think there still may a deeper issue in the queue system I’m using so I’m doing some more research and experimentation related to that. I’m actively monitoring in case the queue goes wonky again.
Saturday: Things have been stable the last few days. I’m not 100% convinced the problem has been fully solved but I do have a lot more redundancy and instrumentation in place as well as various simplifications/hardenings that may help overall stability. It’s likely the issue earlier this week was due to the peak traffic the site was experiencing so maybe I’ll be writing another one of these after 7.3.0 lands!
Series of Events
At about 5:42pm PST, the SimC worker machines stopped processing incoming work from the website:
It took me a bit to find out there was an issue because my monitoring doesn’t pick up this scenario and I was out of the house. I checked my phone notifications at about 6:05pm and found out something was terribly wrong.
After a 30 minutes of fumbling with a very slow tethered connection to a laptop (and also draining the battery of my car in the process…), I was able to verify the weird issue of workers running but not actually taking any work from the queue. lithium_ and Xanzara were trying to attack some other problems as I had given them access to a couple of admin tools but it wasn’t making much of a difference (huge thanks to those two for chipping in!). At 6:40pm, I forced all running workers to immediately terminate with the hope that Warchief would immediately spin up new workers in their place. At this point the queue was up to nearly 3,000 jobs so I also spun up additional workers in a different availability zone (under the alternate possibility of a networking issue between some of GCE’s AZs).
New workers started taking work and the queue started draining. Things looked decent and I was able to get a jump for my car so I started heading home to do some deeper diagnosing.
Unfortunately, problems hit again shortly after 7PM. All stats stopped reporting and the website stopped responding completely.
Once I got home, I determined that Redis was under some kind of duress so restarted the process (Redis is the database used to manage the queue and login sessions).
Once Redis restarted without issue the website came back online. The workers still weren’t handling work properly so I cycled all of them once again at 7:50pm. After all of that, work started flowing through the system again and eventually got back to normal levels.
I’ve now had some time to some deeper analysis of the outage (digging in logs, looking at stats, etc) and have a few more details about what happened.
The second outage was caused by Redis getting killed by the operating system because it used too much memory — the huge burst of activity once workers came back caused a rapid rise higher than levels the site had previously experienced.
This surprised me because I believed I had configured the system properly to perform volatile Least Recently Used eviction — I haven’t verified yet but I believe my configuration was just wrong (and/or debian’s OOM killer threshold is just lower than I thought).
Redis did not get restarted automatically and so it took a manual action to get things running again.
The first outage is still a bit of a mystery. The logs don’t contain a smoking gun as to why the workers just stopped processing work. In fact they don’t really contain anything useful — the normal logs of processing work just slow down to a trickle without any obvious errors or exceptions from the broken machines.
Update: It seems like Redis may be getting overloaded with the amount of workers running and/or the size of sims that are happening. I’ll be playing with resizing the core Redis instance to at least buy myself some time.
Work to Do
This incident revealed a bunch of deficiencies and weaknesses in the overall Raidbots system, some of which I knew about and some that were newly discovered:
- Redis configuration tweaked (done)
- Redis memory capacity increased
- Worker logging and error tracking needs to be improved to make sure that if this problem happens again, I can get more detailed information on why and fix the root issue in the code
- Build tools for trusted folks other than me to kick off possible site recovery methods (cycle workers, rebuild SimC). I can’t be awake 24/7!
- Create monitors for the scenario of workers not actually completing work in this manner (I have some other alerts but the there was a tiny amount of work being completed so they didn’t fire)