Raidbots Tech: The Main Queue

Published in

Raidbots

16 min readMay 1, 2020

Adventures in live migrating a core system

Raidbots is now running on a new queue library that (I hope) will be a bit more reliable, reduce pressure on a number of backend systems, handle traffic spikes better, and cause me fewer headaches.

I’ve mostly finished the migration to the new library without any significant downtime and it’s looking like a promising upgrade. My approach centered around slowly diverting traffic to the new queue in order to identify any bugs or performance problems as quickly as possible while being able to cut back to the original system if things went sideways (or when I wanted to go to sleep without worrying about weird issues popping up overnight).

This technical post will provide some history of the main job queue’s evolution and go through the process I used to perform the migration.

Overview

The Raidbots queue is the heart of the site. From a backend perspective, Raidbots is essentially a system that puts text into a queue, runs that text through SimulationCraft on big beefy servers, saves the output, and then displays the results. Every user of the site and just about every server on the backend interacts with the queue in some way. Today, it needs to be able to handle over a million jobs a day with peaks of over a thousand sims running at the same time.

Broadly speaking, web servers receive sim requests from users and put them in the queue — it’s a first-in-first-out priority queue where paying Premium members skip ahead of free users. Flightmasters grab sims from the queue based on priority and get to work, providing updates on the status of the sim and eventually saving the data so that users can see the results. Warchief is a standalone server that watches the queue, estimates the number of servers needed, and then scales servers up or down. The goal is to make sure there are enough servers so that no one has to wait too long in the queue as well as making sure that servers aren’t sitting idle for too long.

Web servers, Flightmasters, and Warchief are the core systems that interact with the main job queue

This a simplification of the overall architecture but is probably sufficient for this post. If you want to know more, there are some older posts that go into more detail:

When I first started developing Raidbots a little over three years ago, I chose the NodeJS library Kue to handle job processing. This was largely due to familiarity — I had used it for a couple projects in a previous job. Kue had a number of out-of-the-box features that I required: things like handling priority, tracking job progress, job logs, and the ability to scale workers horizontally. I didn’t spend a lot of time on the decision as Raidbots was a side project that I was hacking together for fun in my free time.

In those early days, peak traffic was along the lines of maybe 1,000 sims in a day running them on a single worker server. Growing the site by ~4 orders of magnitude has not been without its challenges and bumps — and most of those challenges have centered around the main job queue.

History

Kue is a library that provides an interface for creating jobs, processing them, and writing/reading metadata about those jobs (current progress, any logs for the job, current size of the queue and the states of various jobs, the position of a specific job in the queue, etc). It uses Redis to store all this data so that multiple servers can coordinate work (multiple web servers submitting, multiple workers pulling jobs when they can).

Early on, Kue was responsible for quite a lot of things:

Submitting new sims which included the full input for SimC and metadata about the sim (what items were selected for a Top Gear sim, etc)
Dispatching those jobs to worker servers that run SimC
Managing locks for detection of stalled jobs (recover from worker crash or shutdown)
Notifying Warchief when jobs change (primarily for tracking and estimating scaling needs)
Updating and retrieving progress for each sim (the simulation processing screen)
Storing and retrieving SimC output logs for each sim (also the processing screen)
Handling other backend jobs (mostly reporting)

It turns out that Kue was a poor long-term choice. It has a number of performance problems, it does not guarantee atomic operations on jobs, and maintenance stopped on it a year or so ago.

Kue uses Redis for persistence — all jobs are stored there and queries about job status, progress, position in queue, and others go through a single Redis instance. That queue Redis server is a pretty significant single-point-of-contention and I’ve had to make many adjustments over the years to ensure it doesn’t get overloaded (and it has been overloaded a few times which resulted in site downtime).

Features like Top Gear can result in giant SimC input text files and a lot of user selected data. Initially this was all stored in Kue — the problem is that many operations store and/or retrieve the full job data even when you only need a single field. This results in a lot of unnecessary CPU time and network traffic on Redis. Most of this data has been shifted to Google Storage and the job data itself is very lightweight.
SimC output logs can also be very large and update rapidly (up to a few updates per second). Kue’s log support was very basic and always returned all log lines with no way to paginate/filter. This has all been shifted to an independent Redis server and I don’t use the built-in log functionality at all.
Job progress is another are of high frequency reads and writes — workers need to write this data frequently so the website can show users an up-to-date status of their sim. Much like logs I shifted this to a separate Redis server and used pubsub since it didn’t need to permanently stored. There is still some progress stored directly on the Kue jobs but it’s updated less frequently and not user-visible.
I shifted all other jobs to other servers so that Kue is only dealing with the main job queue.
Every time a job is updated in Kue (state change or job progress event), it sends a pubsub message to every other online job processors even though that data is never used. As traffic has grown, this means more processors are needed concurrently which magnifies the amount of traffic and the time it takes for Redis to publish those messages. The only temporary solution for this problem was to vastly decrease the frequency of various progress updates.

That rising yellow time is the amount of time spent running PUBLISH as servers are scaled up

That last bullet point is particularly problematic as it has quadratic properties. Each additional server ends up broadcasting to all other servers. This has major implications for where the breaking point is for the site and spurred me to make the queue migration project a much higher priority.

In addition to the numerous performance problems (some of which were by my own initial designs), Kue doesn’t perform any operations atomically. Each server dealing with job data simply makes independent Redis commands to read/write the data without any guarantees that it’s the only server operating on that job at the time. This can result in race conditions and corrupted/invalid job states which might cause a sim to fail, make a sim run twice, or other unexpected behavior.

At some point in the last year I forked my own copy of Kue and began making my own updates because the project had gone pretty stale. But the level of change that would be required for addressing issues like the worker pubsub broadcast problem were much deeper dives that seemed like a bridge too far.

Kue has been a trooper and has taken Raidbots a long way, but it was clearly time (probably long past time) to figure out how I could level up the main job queue.

Which Queue?

I’ve been looking into options off-and-on for quite a while. The major options that kept appearing on my radar were:

Google Pub/Sub — Google’s distributed messaging service.
RabbitMQ (or similar) — a dedicated messaging service I could run on my own VMs.
Bull — a NodeJS library very similar to Kue but written with consistency and performance in mind. Has the biggest feature list.
bee-queue — A more minimal NodeJS job queue with fewer features but potentially higher performance and consistency compared to Kue.
Write my own.

Google Pub/Sub is a hosted service in Google Cloud Platform which is a large-scale, distributed messaging system. The major advantages of a hosted service like this is that I don’t need to maintain a new server and it should provide effectively unlimited scaling. The big tradeoff is that features like priority and consistency aren’t available and guarantees about message delivery times are much laxer. And on top of that, it doesn’t have any built-in concepts of jobs (current state, progress, retries, etc) so I would have to use it just a layer for message delivery and rebuild all the semantics of jobs on top of it.

I actually had a working prototype of another backend queue using Google Pub/Sub a couple years ago but it was insidiously tricky to get it to work correctly and didn’t have the kind of responsiveness I wanted for a user-facing system. All the same tradeoffs exist today so it’s not a suitable candidate for the main job queue without radically changing a lot of other things.

RabbitMQ or other similar solutions would also require building a job layer on top of the messaging layer. I’d expect it to probably have better performance characteristics but be more difficult to scale compared to Google Pub/Sub. While it’s on my list to research some of these further in the future, the amount of change required to use something like this definitely put it lower in my rankings.

Bull, bee-queue, or writing my own all map very closely to the current system. I believe that Bull and bee-queue were largely created as more modern versions of Kue with a focus on performance and consistency, the main areas where I’ve been running into problems. Their APIs are fairly similar, they handle all the basics of a job processing system (job status, progress, detecting stalled jobs, etc).

I actually use bee-queue on Raidbots for some systems that are not user-facing and it has been rock solid for over a year. The big gap with bee-queue is that it doesn’t have support for priority. If bee-queue had built-in support for priority, it would be an easy choice.

Bull is widely used, supports priority queues, and is a very close match to how Kue works so migration would not require significant re-architecture. It does have more features than I need (delayed/scheduled jobs, rate limiting, etc) which may have overhead that doesn’t benefit my use case.

I think I could write my own job library that is tailored to my specific needs but that also just seems like taking on a ton of responsibility with a lot of complicated edge cases. I imagine it would just cause me more pain.

For this project, I chose Bull as the first candidate to try. It doesn’t require any significant changes in architecture and has a strong chance at solving the most immediate problem (the worker pubsub broadcast problem). It probably wouldn’t suffice if I needed to scale up by another order of magnitude but I think that’s pretty unlikely.

The Migration

Replacing the main job queue has been a project looming over my head for a long time — it’s something I’ve desperately wanted to do for quite a while but it felt daunting because of how core it is to the site. I needed to figure out how to perform the migration incrementally to verify a new system would actually work in production so that I could diagnose/fix any issues all while keeping the site up and sims running. It’s sort of like changing a tire on a car while it’s barreling down the highway.

I’ve done smaller scale migrations like this before on Raidbots but none with as many dependencies or moving parts.

The first thing to do was just hack together a local prototype in a feature branch that only used Bull to see if there were any significant differences or gaps in functionality. This prototype focused on the core functionality and completely ignored figuring out how to run both queues at the same time across all the servers.

Initial results were promising — It took about half a day to get things running and it seemed like there weren’t any major gaps. The APIs of the two libraries were pretty similar but not exactly the same, there were some minor things that seemed a little amiss (e.g, I wasn’t seeing events published when a job was enqueued which I use to help determine scaling needs), but overall the initial showing was strong. So I threw it away.

Just kidding, I simply used `git checkout master` and kept the prototype for reference

The primary value of the prototype was proving that it could work and identifying all the points in the codebase that would be affected. The actual implementation of the new queue was the easiest part and could be recreated without issue.

The API differences were my biggest concern at this point. In order to safely migrate, the entire codebase would need to be able to switch between queues or even use both at the same time. Since I had a pretty fresh understanding of all the queue touch-points in the code, I built an abstraction layer that unified the API and replaced the Kue-specific code with the abstracted code.

This abstraction layer was only using Kue at this point, I hadn’t re-added any Bull code yet. I deployed the updated code with the added layer and monitored that for a day to make sure that the abstraction layer didn’t have its own bugs.

Once that was all sorted out, I finally started building the real, production-ready implementation for Bull. I had the prototype as reference so it was pretty straightforward. I could switch my development environment between each queue by changing which implementation was used and was able to verify that both worked when used one-at-a-time.

While at this point I could have put the whole site into maintenance, swapped the queues, and brought the site back up, this seemed far too risky — if there were significant problems with Bull, I’d have to take the whole site down again, switch back to Kue, bring it up, fix the bugs, then repeat.

What I wanted was a knob I could turn that would let me direct a dynamic amount of traffic to Bull. This would let me start out small so that any major bugs would have a small impact on users. I could also slowly ramp up traffic and detect any performance problems early before they caused any widespread problems.

Running experiments like this is pretty standard practice in large systems. The tricky bits for this migration were that I needed to make sure that the number of job consumers (Flightmasters in this case) stayed balanced to the current traffic levels and that many systems would have to be deal with both queues running at the same time. If there weren’t enough Flightmasters or if one of the queues became too busy, wait times could skyrocket. The flip side is that if there were too many, the backlog would be too small to test the behavior of Bull at normal queue loads.

What I ended up doing was something like this:

A single experiment configuration value called BULL_FLIGHT_PERCENT which was a value from 0 to 100 — 0 is fully on Kue, 50 is evenly split traffic between Kue and Bull, 100 would be the Full Bull.
Web servers used this configuration value to determine which queue would be used when a new sim was submitted. It would roll a 100-sided die to determine where to send it. APIs like job status used on the processing screen would just look at both queues for jobs and return data from the one that had the job in it.
Flightmasters (the servers that receive the main jobs and coordinate with workers) would run either in Kue mode or Bull mode, but not both.
Warchief (the core “orchestration” server for Raidbots) is constantly figuring out how many Flightmasters and workers need to be running to satisfy incoming traffic. I modified this to also look at BULL_FLIGHT_PERCENT to determine how many total Flightmasters were required and then split that total between Kue mode and Bull mode. E.g., if 100 Flightmasters were needed and BULL_FLIGHT_PERCENT was 10, it would run 90 in Kue mode and 10 in Bull mode.

Warchief scales the Bull and Kue groups based on the experiment, web servers also talk to each as needed

Having done all of that work, I could finally start really putting Bull to the test in a live environment and see how it compared to Kue.

Over the next few days, I would slowly ramp up the percentage of sims being sent to Bull, watch metrics, make any adjustments needed, and then ramp it back down before going to bed.

I also spent some time creating a more customized admin view of queue and job states so I could watch jobs and queue state a bit more easily. Both Kue and Bull have their own dashboards but they are a bit janky and only show status for one of the queues.

There were a few bumps along the way but overall Bull proved itself to be significantly better than Kue. It only took a few days until I was confident enough to flip Bull to 100% and it’s been running that way for nearly a week. If new issues do pop up, I can still flip back over to Kue in case of an emergency.

The Results

Performance/Scaling

The biggest performance win looks to be that Bull scales linearly with site load. I haven’t seen any evidence of quadratic growth in Redis CPU or traffic. The Bull codebase also bears this out as event listeners are opt-in and lazily allocated.

Wednesdays are the busiest days on Raidbots as both NA and EU have reset, weekly chests are being opened, and raids are hopping into their reclears. Here’s a comparison of Redis CPU and network traffic from Bull at 100% this week (the purple lines) compared to Kue at 100% last week.

Purple is Bull from this week, Blue is Kue from last week. Lower lines are better.

The bottom charts are CPU / NumWorkers and NetworkTraffic / NumWorkers which gives a good sense of just how linear the Bull performance is compared to Kue. Traffic levels were very similar this week compared to last week so this is a pretty slam dunk result for Bull.

Consistency

It’s a little bit harder to quantify, but the fact that Bull is atomic (primarily by using a lot of small Redis Lua scripts) does appear, so far, to have reduced the number of failures that are specific to Raidbots queue issues.

Failures rates are generally very low but I’m seeing even fewer issues in logs and metrics. Kue has a tendency to just lose track of jobs and spew non-critical errors when it gets confused and I’m not seeing the same level of cruft from Bull.

I’m hopeful that job reliability will be just a little bit higher moving forward.

Issues

I did run into one performance problem with Bull but I was able to develop a workaround.

The job processing screen polls a job status API to get info like the job state (in queue, processing, complete) as well as the current progress of the job. In Kue, this is all available from a single call when you use job.get(jobId). In Bull, the job state requires a separate call, job.getState(). Only after I had Bull in production and saw some concerning metrics did I notice the comment on that API call:

Please take note that the implementation of this method is not very efficient, nor is it atomic. If your queue does have a very large quantity of jobs, you may want to avoid using this method.

The way that Bull stores data is the culprit. The actual job data is stored in Redis hashes but does not include the current state of the job. Job state is stored in a number of Redis lists that simply have job IDs. Redis lists are unsorted so finding the job state requires blindly iterating through all of the jobs currently in the queue.

Fortunately, I was able to devise a shortcut. The job status API always makes a call to determine the job’s position in the queue — in Bull this is a ZRANK on a sorted set which is pretty quick (O(log(N))). It either returns a number (the position in the queue) or null (job is not in the backlog). If a number is returned, I immediately know the job state is “queued” so I can skip the call to getState(). Otherwise, I do need to make a call to getState() to determine if the job is being processed or has completed. The vast majority of calls to the job status API happen while the job is queued, so I‘m able to skip the expensive call nearly all of the time.

There were some other minor issues that turned out to be simple bugs that I was able to fix in my own fork and submit a pull request.

So that’s it! I’ve been desperately hoping to do this migration for a long time as Kue has been a thorn in Raidbots’ side for a long time. It looks like I’ve bought a bit more headroom for the site with the switch and am glad to be rid of a whole class of problems that came from Kue.

Hopefully this migration will be sufficient for a while. Redis is still a single server so there’s a strong chance I’ll have to evaluate more changes in the future if load becomes an issue again. The core queue is already pretty stripped down to minimal functionality so future changes would likely require bigger architectural changes that may also impact the user experience. We’ll see what happens!

This migration is also just one step towards making larger sims a reality — there are still several difficult problems ahead that I hope to tackle: generating large numbers of combinations is slow, testing out alternative optimization algorithms, and dealing with increasingly large data files are all aspects of the problem (and there are probably more that I haven’t even run into yet).

I always love talking about this stuff so if you’re curious about anything in particular or have any questions, come chat in the Raidbots Discord or hit me up on Twitter!