Performance Testing at Space Ape Games

Louis McCormack
spaceapetech
8 min readMar 5, 2024

--

Performance is paramount in the world of free-to-play mobile gaming. There is the obvious corollary: if we offer a sub-par experience to our players, they are unlikely to stick about, unlikely to come back and unlikely to spend money.

But there are also more opaque reasons: the Google discovery algorithms, for instance, treat performance issues (along with app crashes) as key signals for discoverability. At Space Ape Games, we’ve seen a huge variance in the number of organic installs just from dropping out of certain performance tiers.

But measuring device performance is difficult. We need to support a wide range of devices; how can we ensure our games run smoothly on all of them? We release regular updates and new features; how do we ensure these changes haven’t degraded performance?

It certainly is challenging. One approach is to simply eyeball it: play through the game and hope to notice any performance issues before they make it into the wild. However this doesn’t scale. We have a world-class QA function here at Space Ape, but there is no way they could regularly play through our games on all of the devices we support. Plus their judgement would be subjective, and they may miss iterative degradations that eventually add up to a larger problem.

No, we needed a more scientific approach. As a tech organisation, we have a lot of experience measuring the performance of our backend systems. Could we apply some of that knowledge to this domain?

This article is the story of how we did just that, with the help of two of our partners — GameBench and AWS.

What to Measure?

How can we objectively measure the on-device performance of our mobile games?

The way we approached this was to consider the factors that may cause our players to have a bad experience, and then find representative metrics. For instance:

  • Are our players waiting too long to play our game? App Startup time, scene load time
  • Is the game stuttering or juddering? Frames Per Second (FPS)
  • Is our game having an adverse impact on the player’s device? CPU / Memory Usage

Of course, identifying the metrics is only half the battle. We also needed to pick values that represent good or bad measurements. This is where things get a little complicated, due to the diverse range of mobile devices that we need to support.

For instance — ideally, our games run at 60 FPS. On some low-end devices we accept that they will run at 30 FPS. But on some mid-range devices we allow parts of the game to run at 30 and parts at 60 FPS. Its a similar story with load times, high-end devices will load the game faster than low-end.

CPU usage we found to be somewhat volatile depending on device, and mostly useful as an investigative aid. Memory usage thankfully is more clear-cut: iOS will terminate apps that use greater than 70% of RAM on a device, so we want to know if we are even approaching that threshold.

By this point we were building up a clearer picture of what we wanted to measure, we just needed a way to actually measure it. Enter GameBench…

GameBench

We fairly quickly settled on a tool to help us in this endeavour, from a company called GameBench.

GameBench (from the company blurb) are a world leader in performance management and their product (the Studio SDK) promised to help us in the scientific measurement and optimisation of our games. They are also trusted by some pretty big game studios. Sounds good!

We make our games using Unity, and GameBench provided us a Unity SDK which we dropped into our projects. Now, whenever anyone in the studio plays one of our games, a rich set of metrics are sent off to GameBench and presented to us in a Web UI. All without any instrumentation of our own code!

The GameBench web UI

Another requirement which GameBench ticked off neatly was the ability to define different sections of our games, through their Marker system. This means we are able to have different desired performance characteristics in different parts of our game. It also reduces the danger of metrics being smoothed out over the course of a test — e.g. if we had maintained 60FPS for 90% of the test but for 10% it dipped to 20FPS, the overall test may still pass but that wouldn’t be a great user experience.

CVC Autoplay Build sends metrics to GameBench cloud service

In order to provide a repeatable process, we concocted an autoplay build of our games. The first game we tried this with was our match-three hit Chrome Valley Customs (henceforth referred to as CVC). The CVC autoplay build steps through the vehicles, completing various parts of the car restoration process, and plays through match-three levels just like a player would!

Now we had a way to reliably measure our performance indicators on-device, we needed a way to run performance tests at scale, on a wide range of devices. Enter AWS Device Farm…

AWS Device Farm

Imagine thousands of mobile phones in a data-centre, twinkling and beeping away whilst being cooled by giant fans. That’s AWS Device Farm, probably. Regardless, what it does is provide us compute time on a huge range of devices running in a controlled environment.

That last point is an important one: performance tests must be run in a controlled environment. If external factors are able to influence the outcomes, we cannot trust the results. Before we went much further we needed to be assured that AWS Device Farm did indeed fit the bill.

We performed a simple validation of this fact: we ran the same performance test on the same devices, twice a day, for a whole month. At the end we collated all of the results, analysed the deviation from the mean and found that the variance was within acceptable bounds.

By this point, things are shaping up nicely. We had a nightly autoplay build of CVC with the GameBench SDK integrated. We had a way to run this build in parallel across a diverse set of devices. We had our desired metrics being collated and presented back to us in the GameBench web portal.

But there was still something missing. To make this truly useful we needed to expose this information to our developers. It wasn’t enough to simply collect this information, we needed to be told if we had breached our performance thresholds. We needed alerts.

GameBench does provide Alerts, which would work just fine for smaller projects. But we had dozens of tests running each night, on several devices each with different performance criteria. A failure would trigger dozens of Slack messages which would, we know from experience, lead to alert fatigue. Furthermore, we now had Device Farm logs and screenshots to aid in any investigation. We felt we needed something to orchestrate all of this and present all of the information in a single place…Enter AWS Step Functions

AWS Step Functions

In order to explain what AWS Step Functions is, it is useful to show what we wanted to do:

The diagram shows the workflow that we wanted our performance tests to take. Each node in the diagram represents a state in the workflow, and the adjoining lines are transitions between states. What we have here is an example of a State Machine.

That, essentially, is AWS Step Functions: a way to define a State Machine (or a workflow) in AWS. Each step in the workflow can either consist of a logical function (e.g. a Choice or a Wait) or enact some sort of Task. Tasks are where things get interesting. AWS Step Functions are tightly integrated with many other AWS services, so a Task can easily trigger some other action.

In our case, we configured a Task which would kick off a Device Farm run. Within each run, we play the performance build of CVC (the autoplay build with the GameBench SDK integrated) on a pre-configured collection of different devices (called a Device Pool). We then configured another that would loop until all of the runs had completed.

We now have a way to orchestrate our nightly performance tests, to trigger them on a range of devices and watch until they all complete. But what then? How do we extract the vital statistics and broadcast the results?

Well, another nicety of AWS Step Functions is that each of the steps can also invoke a Lambda function (or indeed an ECS task). Once all of our Device Farm runs are complete, we trigger a Lambda function which:

  • Identifies the GameBench session ID of each run.
  • Queries the GameBench API to get the performance metrics for that session
  • Compares the results with our pre-defined thresholds
  • Collates all of the information into a nice Slack message

Ultimately, it is the Slack message which is the output of this entire fandango. Here is an example:

As you might expect, it shows clearly which devices have succeeded and which have failed. But we can also see in which section of the game the failure occurred. We also provide links to a wealth of information that can be used in investigating why they failed: the GameBench web UI for the specific session, Device Farm logs, application logs, even a link to a screen recording of the failed run!

Conclusion

We now have a way of objectively knowing if our games are suffering from poor performance. We run these tests nightly against our development builds, so we can be told if any changes we’ve made during the day have impacted performance.

This is incredibly useful — we catch issues on a regular basis: from unsuspected memory creeps to unwanted dips in FPS caused by new features.

This means we can be assured that the products we put out will continue to please our players, and we won’t fall afoul of any performance penalties from our platform partners.

But we’ve also seen a change in our developers. They are now able to take a much more analytical approach to performance; we have performance champions within our games teams, we now talk of performance budgets, and in general device performance is shifted more to the forefronts of our collective minds.

Thanks for reading!

Donna and Uncle Hank approve

--

--