Raidbots Tech: Flightmaster

Seriallos
Raidbots
Published in
11 min readAug 12, 2018

Technical details about how Raidbots is running larger SimulationCraft jobs even quicker.

This is a followup post to Raidbots Technical Architecture which covers the core Raidbots architecture up until about early 2018.

Turns out I’m chatty about this stuff so this is a long one. I’ve tried to pepper the post with shiny charts to keep you entertained!

Quick Overview

Through all of 2017, the architecture of Raidbots looked something like this:

  • Frontend is a React application that communicates with the web API servers
  • Web servers handle all the web API calls and create jobs that go into the queue.
  • Workers pick up those jobs, runs SimC, and pushes the artifacts to Google Cloud Storage
  • Warchief is a singleton server that handles a bunch of “offline” responsibilities. Of primary interest for this blog post is that it constantly monitors the queue and handles scale-up and scale-down of workers as needed.

The size of jobs from users is also dramatically variable. A Quick Sim generally requires about 10k iterations of SimC and may only require ~10 seconds of Worker time. A Top Gear simulation may require 2.5m iterations and can take 10–20 minutes.

2017 Architecture Results

The core architecture through all of 2017 had worked really well for running SimC on fast cloud servers and handling some pretty significant scaling swings. When the last post was written (mid-May 2017), Raidbots was handling about 25k sims per day — when Antorus Mythic/LFR was released in early December, Raidbots ran about 190k sims. At peak traffic, I was seeing about 5 sims submitted per second. Redis and Kue held up like champs, Warchief kept GCP busy spinning up new workers (peaked at ~120 machines which was 2,880 cores working on your sims), and 23 billion virtual boss fights were simulated.

Mythic Antorus release — Dec 4–6

Even with that traffic surge, queue times remained stable — I aim for Premium/Patrons to get near immediate sims (<10 seconds) and public users to see 2–4 minutes waits.

Queue Wait Time — Dec 4–6

There were some pretty significant problems and site outages in the summer of 2017 mostly related to the size of messages in the queue and resulting traffic/contention in Redis. Eventually, a combination of scaling up Redis, compressing queue messages (especially large SimC input text), and lots of other optimizations, the site has been pretty stable.

But the biggest limitation of Raidbots since launch has been the size of simulation possible. Until recently, the size was maxed out at ~2.5m iterations for a single sim. The core reasons for this limitation are technical: SimC sims are atomic (the entire sim must finish to get results) and worker machines are ephemeral (they are preemptible/spot instances so they can disappear at any time and warchief may scale them down based on load).

The result is that the larger a sim, the higher chance it would be killed preemptively, causing it to lose all work done, get restarted, and clog up the queue. So I set the maximum size and kept working on changes through most of early 2018.

Flightmaster — ”MapReduceish” for SimC

The result of that work and research is Flightmaster — a service that sits between the web servers and the workers which has a couple core responsibilities:

  • Split profileset sims into smaller chunks, run the chunks, and merge the results
  • Manage multistage sims for Smart Sim

Flightmaster is the core technology that let me quadruple the iteration limit for Top Gear sims and build Smart Sim for BfA. It’s a relatively small (900 LoC) NodeJS service that manages all the work orchestration for any sim that’s run on Raidbots. Most of the code is dedicated to managing those larger sims that get split into smaller pieces.

I chose to write my own system for this for a few reasons:

  1. I already had a lot of code built up around Kue’s job system (the browser client gets near real-time feeds of the job status and logs, scaling is closely linked to the job system etc).
  2. Having an intermediate layer that I fully control lets me explore additional ideas in the future (mostly around multistage sims)
  3. It seemed like an interesting project

The Core Concept

The idea of splitting a big job into smaller parts, running each part independently, and then merging the results isn’t new by any stretch of the imagination. The term “MapReduce” came from Google in 2004 and I’ve been thinking and talking about it for over a year:

There’s a good reason I try not to make promises about timelines

A “MapReduce-like” approach would let me reliably run very large sims on “unreliable” hardware — as each chunk was completed I could save the results and not lose as much progress in the case of a server going down. In addition, if I had an intermediate layer in the software, I’d be able to possibly perform additional tasks like handle multistage sims or other more complex sims where stages depend upon results from the previous stage. The worker service/machines are designed to very “dumb” to make sure they are simple, easily replaceable, and easy to maintain.

From pretty early on, I had a few clear design goals for Flightmaster:

  • Handle any SimC input— this would keep it more flexible and be able to handle any advanced input as well as reduce dependencies on any custom input code I had.
  • Incremental/transparent rollout— I wasn’t sure if the idea would work so I wanted to throw real traffic at it while developing to be able to test ideas quickly. I also wanted to avoid any significant downtime that would affect users
  • Resilient/fault tolerant — The cloud is often a fickle place so I always try to build my services to be able to handle intermittent failure. The other upside to this is that if you can handle failure consistently, you can often run on cheaper preemptible/spot resources. It’s also a great way to ensure a service can easily be scaled up and down at a moment’s notice.

Splitting and Merging

One of the SimC features that made a lot of this easier is profilesets. Profilesets were introduced in SimC mid-2017, around patch 7.2.5, as several SimC users/devs were running larger and larger sims (Raidbots, HeroDamage, and Bloodmallet to name some of the early folks). They were created primarily for performance purposes (lower RAM requirements and often just faster) but they have the additional characteristic of being simpler to parse and model. Each profileset is a mutation of the base actor (unlike copy actors which can inherit from any other copy actor).

That’s right, screenshots of text primarily used as a visual delimiter between huge blocks of text

The nice characteristic here is that you can take any SimC input with profilesets and break it down into smaller parts without needing more complex actor dependency code.

  1. Parse base actor
  2. Parse all profilesets
  3. Create several chunk that contains the base actor and N profilesets
  4. Run those chunks
  5. When complete, merge all the profileset JSON results into one file

Flightmaster splits the profileset input it receives and then queues and monitors these chunks on the workers in series. When a chunk is complete, it moves on the next chunk until all work is done. At chunk complete time, it grabs all the JSON from each chunk, uses the first as the “base” JSON, and then merges all the rest of the profilesets into that final file.

As a bonus to this approach, each chunk is a real sim on Raidbots that has its own progress and file outputs which made the whole thing very transparent as I developed it. I could watch the overall progress as well as each chunk progress to find bugs and inconsistencies.

A Flightmaster in flight

Incremental Development and Rollout

I ended up with a design that accepted all the same input as an individual worker (a Kue job with simc input) and resulted in most of the same outputs (data.json saved to Google Cloud Storage). I also replicated most of the job data structure so that the website needed very few changes for things like the job processing stage — only significant difference was that the website needed to know if the job was an original simc type or if it was a newer simc-flight type — all the metadata on the jobs were the same (current progress, simc logs, etc). The job processing stage could even look for the same artifacts to determine completion (is the job in a complete state? Can I retrieve data.json?).

This meant that with no changes to the worker and only small changes to the web sever, I could direct a configurable amount of sim traffic at flightmaster while all the rest kept going to the tried-and-true workers. I had to make a few additional changes to Warchief scaling to make sure that both simc and simc-flight jobs were taken into account to make sure there were enough workers to handle the load.

While this approach probably took a bit more time, it resulted in no significant downtime and I was gaining valuable information the whole time. It let me run experiments during the day while I was awake and then go back to the stable method overnight. As a one-person shop, the ability to not worry about the site going down overnight is pretty huge. I could also tune various parameters in nearly real time. I spent a ton of time just letting it run, watching the metrics, and implementing small changes to see how it would affect the whole site.

Make small changes quickly

Fault Tolerance

Handling failures and retries gets more complicated with Flightmaster — instead of one sim being one job, there’s now quite a few jobs in flight. If a Flightmaster is terminated or is scaled down, another instance should be able to pick up the same job without losing too much progress (otherwise the site is in the same position as before just with a ton more moving parts).

The nice thing about the single-sim-per-worker approach is that no additional state tracking is required outside of the queue system itself. With chunked sims, there’s more tracking that needs to occur. I’ve opted to store progress in the main persistent database (Google Datastore) which requires Flightmaster to be a bit chattier with the database. When Flightmaster receives a job from the queue, it checks to see if any progress had been made on it already and then skips ahead to the point where the last instance failed.

As mentioned above, this approach means that Warchief can scale Flightmasters up and down without much concern about the status of sims in-flight. There may be a little bit of duplicated work if a job is handed off to a new Flightmaster but it generally won’t be a huge amount.

Before this, Warchief was pretty conservative with scale down events both because A) I didn’t want to restart a large sim as it would be a poor user experience and B) because there were nasty scaling bugaboos that could cause issues for large sims.

As an example, one of the intermittent issues that the previous system looked something like this:

  1. Big sim is submitted by a user at a low activity time (not that many workers running)
  2. Warchief scales up new Workers to meet the new demand — due to the approximate nature of the scaling algorithm, it would scale up more than required.
  3. After a couple of minutes, Warchief would determine that a bunch of servers are idle and trigger a scale down event. The scale down would kill the large sim that started the scale up.
  4. This could happen multiple times in a row until the big job ran out of attempts and would fail.

So the result was that the user who wanted to run a big sim would see a much longer than expected or outright failed sim. It was gross.

Sample from Oct 2017 — Yellow on Workers chart indicates idle servers. Yellow on Queue Status is queued sims, Red is a sim in progress.

These scenario made me pretty conservative to try to avoid the bad scenarios. Conservative scale down also resulted in a wave-like behavior for the queue size which you can see in the Queue Status chart above — Warchief would wait to see idle servers before scaling down to avoid terminating active sims.

With chunking and proper resuming in Flightmaster, this scenario almost never occurs now and gives me the freedom to change scale more rapidly. I also built a much more detailed queue time estimation algorithm (it approximates wait time by simulating the current queue against the current worker pool — that’s right, simulating sims) to provide a much more consistent experience.

Same chart from mid-July 2018

This provides more efficiency in worker usage and sets a more consistent expectation for public users. Although it is way more boring — before Flightmaster public users would get wildly different queue time based on the luck of the draw. If they submitted right after a big scale up, they’d get near instantaneous queue times since the backlog would be drained.

Operational Details

Flightmasters are currently running on g1-small GCP machines (1 vCPU, 1.7GB memory) and each Flightmaster manages 2 running jobs. Flightmaster and Worker scaling is pretty closely linked which is why a single Flightmaster only handles 2 jobs — that number represents the minimal possible scaling delta I can use for workers.

Making sure that Flightmasters and Workers are scaled in tandem has been fairly tough. Warchief runs a monitoring loop which determines scaling needs every 5 seconds. I have a series of metrics that are kept updated to help figure out when Raidbots should scale and by how much. It’s still fairly rough — sometimes Flightmasters will start running before Workers are ready so the site has more jobs in motion than can reasonably be handled by the amount of Workers. It works pretty well right now but I think there are a few nasty failure scenarios if for some reason I can’t provision Workers but have a bunch of Flightmasters in play.

In addition, Flightmasters have a bit more overhead since they’re doing a bunch of bookkeeping work (maintaining state in the datastore, monitoring the worker jobs, merging JSON, etc). Pretty early on it was clear that Workers were running idle if the number of Flightmasters and Workers were maintained at a 1:1 ratio. After a bunch of testing, I ended up at the current configuration where jobs are effectively overcommitted by up to 20% — this keeps workers at a very high efficiency without significantly impacting how long it takes to finish a user’s sim.

Chunk Delay is the amount of time between Flightmaster putting a job into queue and a worker starting

With BfA fast approaching, we’ll see if this newer architecture is affordable and scalable given the much larger sims that are possible.

Looking forward, I want to spend some time experimenting with Docker/Kubernetes a bit more. Flightmaster is a fairly low resource service that would benefit a lot by being able to run an easily configurable number of instances. Running a g1-small for every 2 jobs is fairly expensive for the work that it’s doing. Using Docker images across the board would also likely greatly speed up my service build and deploy times — right now it takes about 15–20 minutes from committing code to having it running live in production which I often find too slow since I tend to make very small changes that I want to observe in production with little delay.

Would You Like to Know More?

I love talking about tech! Come chatter about stuff in the Raidbots Discord, hit me up on Twitter, or leave a comment here!

--

--