How Battlehouse saved $60,000 a year on AWS

Battlehouse operates a portfolio of massively-multiplayer strategy games on Facebook, serving over 5 million accounts with non-stop battle action in titles like Thunder Run: War of Clans.

I‘ve been managing Battlehouse’s back-end infrastructure on Amazon AWS since 2012. Cloud services represent a major part of our operational costs, so controlling our spending on AWS is a high priority.

In this article I’ll explain how we optimized our AWS usage to cut cloud costs by 50%, saving about $60,000 a year, while also improving reliability and speed of our services.

Before we started this optimization, our AWS spending was split roughly as follows:

  • 40% on EC2 instances
  • 30% on network bandwidth and the CloudFront CDN
  • 10% each on S3 storage, RDS instances, and miscellaneous other services

We looked for opportunities to cut spending, starting with the biggest individual pieces.

Reserved instances for EC2 and RDS

The quickest and easiest “win” was to ensure all our ongoing EC2 and RDS capacity was covered by reserved instances. Amazon’s reservation system allows you to pre-pay for 12 or 36 months of usage in advance, and in return receive a discount of about 50% compared to hourly pricing.

It’s important to note a few points about using reserved instances:

Coverage tracking: Back in 2012, it was difficult to know whether all our EC2 and RDS instances were covered by the reservations we purchased, because there were no tools to display coverage. I developed an internal script, now open-sourced at this link, to alert us about any running instances that were not covered by a reservation, or extra reserved capacity we weren’t using.

Today, Amazon’s Cost Explorer includes some features for tracking reserved instance coverage. We still prefer to use our own script because it is simple and easy to use.

Cloud credits: Like many startups, we took advantage of special one-time credit offers from cloud providers like AWS. Unfortunately, these credits cannot be used to purchase reserved instances. This makes accurate coverage tracking even more important. To optimize spending, we needed to deliberately leave our instances un-covered by reservations right up to the point when our free credits ran out, and then quickly buy the reservations once we were spending our own cash.

Annual updates: We preferred 12-month reservations over the 36-month option, for two reasons. First, our own capacity requirements change rapidly as new games are released, so 12 months is a rather long time horizon for future planning. Second, Amazon releases new instance types on a regular basis, and we wanted the flexibility to switch services to newly-released types when they offered superior price/performance.

Typically, when Amazon releases a next-generation instance type, they keep the same pricing on previous-generation instances, and set better prices on the new instances. This means that if you do not move your workload to next-generation instances, you are now paying more than you should. Effectively, Amazon imposes a “tax” on users that do not take advantage of new instance types quickly!

With our handy script, we were able to switch almost all of our EC2 and RDS usage to reserved instances, cutting about 50% off the hourly rates we were previously paying.

From CloudFront to CloudFlare

We originally served up game assets like images and music through Amazon’s CloudFront CDN. Due to the large download size and growing user base, CloudFront and network bandwidth threatened to become the single largest piece of our cloud spend.

The CDN industry is highly competitive and we started to look for better deals outside of Amazon. As of 2018, CloudFlare is the clear industry leader. After running some tests, we found that their global caching system is far superior to CloudFront in both performance and price. By sending our traffic through CloudFlare, plus careful optimization of HTTP headers and compression settings, we knocked off a gigantic portion of AWS network spend.

In addition, CloudFlare also brings the benefit of robust protection against DDoS attacks. This became vital when we suffered an attack that overwhelmed Amazon’s Elastic Load Balancers — a story for another time!

Sharing Infrastructure Across Games

At Battlehouse we operate a portfolio of many games, with 6 currently active as of 2018. Originally, each game title operated as a completely independent stack, with its own load balancer, HAproxy router, API server, and database instances.

The main benefit of this architecture was that we could upgrade, maintain, and repair one game at a time, without affecting service for any other game. Each game operated within a different availability zone, so even a total zone outage could not take down the whole portfolio. At first, we kept all of these stacks in sync manually, then later switched to Terraform to automate configuration changes.

As time passed, however, some game titles became much more popular than others. The top games fully utilized their free-standing infrastructure. However, the lower-performing games were under-utilizing their allocated resources. We were paying for load balancers, proxies, and databases that weren’t ever loaded above 10%, even on the smallest instance types. We had no intention of shutting these games down, but it was annoying to see how much we were spending to keep them operating.

In 2017, we made the decision to share AWS resources across all our game titles. We moved to a single load balancer, consolidated the routing proxies to handle traffic from all game titles, and merged RDS databases. We gained in both reliability and cost efficiency. In the new architecture, each layer was spread across multiple availability zones. For example, we now run generic routing proxies in two zones, each one capable of handling the full traffic load from all games, so a zonal outage cannot take down routing for any game.

After the dust settled, we had dropped from 6 per-game, one-zone duplicates at each layer down to 2 all-game, zone-redundant instances. That cut our number of instances and load balancers dramatically, delivering significant cost savings.

Hot/cold storage and archival pruning

The final piece of our optimization work focused on data storage. Our games use many different storage methods, but they can all be classified as either “hot” or “cold” data.

“Hot” data needs (expensive) low-latency access, and its storage requirement is limited to scaling no worse than linearly with the number of active players. Examples of “hot” data are things like player levels and leaderboard scores.

“Cold” data can be served on (cheaper) high-latency systems, but has no constraints on how large it can grow. This includes tens of gigabytes per day of metrics, logs, and backups.

Currently, we store “hot” data in a MongoDB and RDS and “cold” data in S3. Limiting the size of the “hot” data optimizes our spending on MongoDB and RDS, because we pay for fast service only on a small amount of data. For instance, we use i3-class machines to run MongoDB, which would be prohibitively expensive to scale to unlimited storage, but we can rest easy knowing that we’re only using a fraction of their fast NVMe drives.

S3 is extremely cheap — even with terabytes of archived data, it’s less than 10% of our AWS bill — so we don’t worry about dumping large chunks of data into it every day. However, since the amount of data added daily is proportional to the number of active players, we run the risk of total storage scaling as O(N²) if the player base grows linearly over time. To combat this long-term growth, we implemented a pruning system that selectively drops redundant archived data. For example, we retain daily backups up to the previous month, but only monthly backups prior to that.

Future Work

We’re quite happy with the achievements so far. In case we need to optimize even more in the future, the first target would be to refactor the API server to make it able to run on shared infrastructure. Despite the infrastructure-sharing work mentioned above, we still need to run each game’s API server on its own instance, because the code has some legacy aspects that involve keeping too much state locally. The clear solution is to refactor the code, along the lines of the 12-Factor App guidelines, and run it in a shared environment, perhaps via Docker or Kubernetes. But that’s a story for another day!