Iterating faster with open-source

Johan Mjönes
Sep 11, 2019 · 6 min read
Using Agones to start a game server on Kubernetes.

Size used to be a big advantage when making games.

Big studios with lots of resources could invest in large backend teams and build powerful custom internal infrastructure to support massive numbers of players in a way that, up until a few years ago, was impossible for smaller indie studios to even consider.

But that was back then. We’re at a point now where things that aren’t and shouldn’t be a part of a game studio’s core business can be open-sourced or acquired as cloud-services — in a way that scales with your needs.

The advantage of having a large internal infrastructure operation is eroding, to the point that it’s becoming a burden. And it’s one way in which the barrier to building games is becoming lower.

Embark Studios was started less than a year ago, and our intention is to be a part of and take advantage of this ongoing democratization in our industry.

I’m Johan Mjönes, and I head up infrastructure at Embark. In this post, I’ll be detailing how we’ve used open-source software to run our first ever multiplayer playtest — in production.

Multiplayer game servers

We’ve been deliberately cautious in detailing too much about the game we’re working on. But we’ve said that it’s a cooperative free-to-play action game.

From an infrastructure standpoint, that means that we’ll likely stick with a traditional client-server architecture, with a game client running on the player’s PC or console connecting to a game server operated by us. This game server runs the entire simulation (including physics) for all connected clients.

Game servers are one of the main operational expenses for multiplayer games. One reason for this is simply that they’re computationally intensive. Another reason is that they tend to be optimized late in development. Games are commonly developed and tested on high-end workstations instead of the lower-end CPUs they use when in production.

This results in a situation where the only viable deployment options are high-end instance types. Not only are these instance types more expensive, there are also fewer of them, which increases the risk of servers running out when you need them.

Production first

To reduce the likelihood of this happening to us, we’ve merged development game servers and production game servers into one element. We run them in the same place and use the same software and configuration, regardless of what phase of the game’s life-cycle we’re in.

We can then put tight constraints on the CPU and memory available to game servers initially. This way, any changes in resource footprints — like increasing the number of cores or the amount of memory available to game servers — becomes an explicit choice by the game team.

Making production our primary platform leads to other operational advantages too. Teams working with the game server have a natural incentive to add observability to the game servers, like log collection, metrics, and crash reports.

Game server infrastructure

We run most of our workloads on Kubernetes. We have a small on-prem cluster in-house, but most workloads run on Google Cloud’s GKE. Using the Kubernetes APIs as a baseline for very simple services (like a game server) allows us to run them on almost any provider or co-location, if we need to.

Traditionally, deploying a multiplayer game required custom tooling to manage fleets of game servers required. Google’s Agones launched in the spring of 2018 and it does precisely this. It includes a game server SDK that manages communication between the game server and the fleet management controllers, including health checks and the ability for game servers to label themselves based on capabilities (later to be used for matchmaking).

Deploying our first game server

Game servers usually have low operational complexity. Apart from needing horsepower to run, they run as a single process and only need a publicly addressable UDP port, on which the game clients communicate.

For this particular game, we’re using Unreal Engine. Our first step to running a Linux game server was to have our builds compile and cook game data on Linux, and then create a container with the executable, data and runtime requirements.

With an Agones compatible game server ready to go, we installed Agones on a GKE cluster and started our first game server.

However, setting up game servers manually is not well aligned with our goal to get rid of toil by automating menial tasks. We have since automated the launching on game servers to be on-demand, driven by our matchmaking backend.

One Last Thing

All well and good. Except for one thing.

Our game server containers are currently two gigabytes in size and will grow significantly as we continue to add content. This is problematic for two reasons, both related to costs; startup time and build distribution.

We want to keep matchmaking times to a minimum and if the start-up time for game server exceeds a certain threshold (let’s say ten seconds), we need to spin up game servers in anticipation of players rather than as a reaction to a matchmaking success. These idle servers cost us money.

Build distribution also affects cost both indirectly and directly. Indirectly by increasing our start-up time and directly by egress and ingress networking costs depending on our ability to cache container images.

We could use container layer caching, but the way these game server builds are constructed means that most deltas appear in the form of small changes to two large files; the executable and a single pak file that contains the game data.

Luckily for us, this problem domain has been well explored by others. Systemd’s casync (and the Go version, desync) is a generic but efficient solution that we wanted to explore.

By using this technique, we were able to reduce the delta between two consecutive game server releases to about ten kilobytes (down from the full two gigabytes). We also reduced the start-up time from about a minute (and growing) to about fifteen seconds flat. Still a bit too high for on-demand startups, but the idling part of the fleet can at least be reduced while we optimize start-up times further.

A new way of working

By combining a production-first approach with the use of open-source, we were able to launch our first multiplayer playtest in production after about a week’s worth of work.

We see two major benefits with this approach:

  1. Running with a production set-up early we make everybody comfortable with using tools and workflows at an early stage.
  2. Cost-control is improved, with game teams able to make informed decisions about the trade-off between optimization and throwing hardware at problems.
I recently spoke at a Google Cloud Summit event about how we work with infrastructure. Check it out here!

We’re planning to keep you posted on the way we’re setting up our infrastructure, and next time I hope to go into a bit more detail about the way our team is organized around Service Delivery, the art of removing delivery friction despite using tools that aren’t always well-adjusted to automation. More about that later!

PS. We’re expanding our team. So if you’ve read this far you should probably check out our career page!

Embark Studios

Embark Studios is a Stockholm-based games studio, on a…