Raidbots Technical Architecture

Published in

Raidbots

6 min readMay 13, 2017

There’s been an interest from several folks to know more about how Raidbots is built so I figured I’d provide some more of the technical details. This is primarily a “what Raidbots uses” with a little bit of “why” — if there’s further interest, I can dive deeper into the details of “how” various pieces work.

Update Aug 2018: I wrote about a new piece of Raidbots: Flightmaster

Overview

If you got here without knowing anything about the site, Raidbots is a web application with two goals:

Provide an easy-to-use web UI to generate scripts for SimulationCraft, an open source tool that simulates World of Warcraft combat which is used by players to answer all kinds of questions about the game.
Provide access to powerful cloud hardware to run simulations (SimC is incredibly hungry for CPU power)

There are a few core areas/codebases behind everything:

Frontend — The browser client
Web Server — Serves up HTML, handles API calls
Worker — Processes SimC jobs
Warchief — Manages the worker fleet and handles various backend jobs
Infrastructure — Databases, deployments, monitoring, etc

95% of the code I write is JavaScript. The other 5% are Bash scripts to manage the infrastructure.

Using JS on the frontend and backend lets me reuse code for things like SimC script generation/validation. Most of the code I write is also not particularly performance intensive but is rather gluing together various systems which is an area that NodeJS can be pretty good at.

Everything depends on SimulationCraft which is written in C++. Raidbots would not exist without SimC and its dedicated developers. I want to throw out a huge thanks to Navv and Collision for the work they do on the project and the help they’ve provided to me in getting everything running smoothly. All the class module developers also deserve a huge heaping of thanks from me (and the entire community) for their tireless theorycrafting and coding.

Frontend

The primary responsibility of the frontend is to be a fast, easy-to-use UI for managing everything. It manages all the client application kind of stuff — routing, API requests and response management, user data entry (text input, UI element selections, etc), and all the display of various bits of information.

The Redux DevTools Extension is pretty great.

The frontend is a “Single Page App” — on initial load the web server provides a nearly empty HTML document, some JS is loaded, and from then on the JavaScript code is in complete control. When done well, this kind of app can provide a very snappy user interface although it does come with a bunch of challenges.

React and Redux are the core tech in use. Together, these provide the bulk of the UI rendering and client state management. I’m using a whole constellation of helper libraries (react-router for URL routing, redux-actions for simplified Redux code, Axios for AJAX requests, and a bunch of others).

React and Redux provide a pretty fantastic approach for building dynamic/responsive user interfaces and provide a ton of great development and debugging tools.

Web Server

The web server is a NodeJS app primarily using Express and Kue. Some of its responsibilities:

Serve up the initial HTML to users
Perform item/relic lookups
Check validity of incoming simulation requests and add those sims to the job queue.
Patreon OAuth flows
Manage user sessions, login, logout, etc
Various other API services (provide sim history, etc)

The critical architectural piece of the web server is what it doesn’t do — run SimC. All it does is create a job with the SimC input that needs to be processed. This decoupling makes it managing the web server simpler and lets me scale those machines separately from the worker machines.

Workers

The worker app and machines are tightly coupled and have very limited scope:

Grab sim jobs from the queue
Spawn and monitor the SimC process and hand in the SimC input.
On success, save the HTML, JSON and stdout/stderr output to permanent storage.
On failure, determine if the job should fail outright or if it should be retried (due to intermittent network errors)
Forcibly kill jobs that overrun their time limit.

Kue, the job processing library used on Raidbots, provides a nice admin UI for monitoring status

Warchief

While web servers and workers are “herd/fleet” servers (they all have random names and are constantly being spun up / spun down), Warchief is more of a “pet” server. Warchief performs a variety of non-user-facing functions:

Manage the worker fleet based on workload size (queue getting too big? Spin up more workers. Workers sitting idle? Terminate some)
Synchronize Patreon/Raidbots accounts.
Perform various bits of queue maintenance (watch for stuck jobs, clean up old complete/failed jobs)
Report metrics and notifications to various locations (Discord and Datadog)

The codebase is a small NodeJS app and the server itself has greater access to the rest of my infrastructure so that it can do things like resize worker instance groups. There are also some good ol’ cron jobs to do things like kick off the nightly worker build

Discord webhooks make it possible for me to track status/errors in real-time in a private Discord channel

Infrastructure

Raidbots is using Google’s cloud offerings for the virtual machines as well as several of their services.

Google Compute Engine is used for Warchief, web servers, workers, and Redis
Google Datastore is used as the primary persistent database with users, summaries of sims run, and other persistent data.
Google Storage houses all the SimC reports/JSON as well as the compiled/bundled frontend assets (primarily the JS run in the browser)

I initially chose Google Cloud Platform (GCP) because of their low CPU pricing but have quickly become reliant on their other services and ease of use. At peak loads seen after the MMO-Champion post, Raidbots was running ~500 CPUs for the worker machines and handling the load like a champ.

Snapshot of the some of the instance groups in use in Google Compute Engine

I treat site reliability as a very, very high priority so I use a bunch of additional services for monitoring and debugging.

Datadog — all of my servers send tons of metrics that I can use to monitor the health of the site and help diagnose issues.

Papertrail — centralized logging for all servers. This is super important for Raidbots given that servers are constantly being created and destroyed. Once a machine is gone, the logs on that machine are deleted as well.

Sentry — Frontend and backend exceptions are sent to Sentry which gives me a nice way to determine how widespread an error is as well as access to helpful debugging tools.