Testing like a pro: tips & tricks to improve mobile game testing

Discover testing insights to transform your games life cycle

Ignacio Monereo
Google Play Apps & Games

--

This article comes from a presentation at Google Playtime 2019. Watch the full talk here:

Testing is a common practice among mobile game developers. However, developers often struggle to prioritize which features to test, and find it difficult to interpret the results so that changes are often postponed or never made.

In this article, we will learn from two mobile game developers about their experience with testing including:

  • the methodologies they use.
  • prioritizing, timing, and tracking a pipeline of tests.

We have one example each for pre-launch and post-launch testing to show how to increase the efficacy of tests across different areas, in these cases retention and monetization.

The importance of careful testing

It is our nature to try new things, whether our goal is to improve key business metrics, to check if we were right about a certain hypothesis, or simply curiosity. This is the reason why before jumping from a cliff with a pair of hand made wings, aviation pioneer Otto Lilienthal made hundreds of test flights.

Google Doodle commemorating Otto Lilienthal anniversary.

Otto was able to fly relatively long distances, over 250m with some of his gliders. However, perhaps the most interesting aspect of his flights, compared to his contemporaries, is that he documented each test flight thoroughly and consistently. By analyzing his results with an unbiased mind and focusing on what the tests told him rather than what he believed to be true, his method enabled him to constantly improve. This led to great strides (and great flights) that later laid the groundwork for the Wright brothers.

When it comes to mobile games tests, whether you want to improve monetization, retention, or user acquisition, having a methodology and carefully interpreting the results are just as important — otherwise, you might find yourself falling off a cliff of your own…

“The main goal to run any test is to gain some degree of certainty in a specific course of action.”

- Bryan Mashinter, VP Live Games Wooga

Testing along the whole player funnel

Most game developers we asked split their tests based on where they sit in the players funnel. We see tests happening either outside the game or inside: in user acquisition ads, store listings, welcome screens, and more. People’s preference may be tested more often than they realize.

External tests: Often related to design, marketing, and user acquisition activities such as:

External tests: Often related to design, marketing, and user acquisition activities such as:

  • Art style: before starting production of the game, developers consider which art style is most suitable for their audience, such as cartoon, Asian, or realistic styles.
  • User acquisition: testing everything from the style of ad creatives to when to show the call to action, and the value of different markets and advertising platforms.
  • Store listing: how to improve the conversion in the store itself — from the icon, video listings, copy, and screenshots to localized assets. Check out how to convert more visits to installs with store listing experiments for more information on store listing tests.

Internal tests: take place inside the game and cover factors such as:

  • Technical and QA tests: focusing on improving the game performance across all devices and avoiding bugs within the game.
  • First time user experience: testing the onboarding of new players.
  • Retention and engagement: to improve these metrics in the short and long terms.
  • Monetization: to improve conversion of players or optimize ad placement and formats.

Which are the most common methodologies used?

When it comes to understanding how developers run tests, we’ll be looking at some common methodologies:

  • Google Play testing and surveys: surveys and feedback conversations with specific demographics can help during the pre-testing phases by quickly narrowing the field of issues, so later A/B tests can ask more focused questions.
  • A/B tests: two versions (A and B) that are identical except for one variation that might affect a user’s behavior are compared. This is one of the most popular methodologies and widely supported in a number of tools.
  • Multi-Armed Bandit and Bayesian testing offer more advanced options that can reduce costs and enable better refinement of live tests. While not a focus of this article, these testing methods are definitely worth researching.

Common challenges when testing games

While testing offers several benefits it also presents developers with challenges. Typical comments, from the developers I interviewed for this article, include:

“Some tests are taking too much time, we have a test which is running for more than 4 months but we don’t know whether we should kill it or we should wait till a result is found.”

-Can Hasan Gökmen, Co-Founder @ Tiramisu Games

Conclusive results are not always guaranteed especially in free-to-play (F2P) games where the payer population is low. According to one developer, this issue resulted in up to 15% of tests being inconclusive.

“Test groups are not diverse enough, and the experiments lack statistical significance; this is the most common issue that results in rejecting a test group.”

-Hugh Binns, VP/Co-Founder @ Eight Pixels Square

Testing might also have undesired consequences in the long term.

“We tested a welcome pack in Food Street recently, although it led to an uplift in revenues, longer term LTV was lower because it took longer for the player to make a subsequent spend”

-Edward Chin, CEO @ Supersolid

Examining these common challenges we find that they fall into the three stages of testing:

  1. Planning, how to:
  • prioritize testing ideas.
  • define the metrics to track.
  • determine a valid hypothesis.

2. Execution, how to:

  • determine the right methodology.
  • ensure enough data from a diverse enough sample is collected.
  • decide the right timing and duration for a test.

3. Results, how to:

  • deal with noisy or inconclusive results.
  • act after the test.

In the following sections we will cover these testing challenges and how two developers approached these questions to improve their business. First, we’ll cover the pre-launch stage and testing for engagement and retention, then move on to the post-launch stage and testing for monetization improvements.

Testing engagement during the pre-launch phase: Battle Legion

The origins of Battle Legion

Based in Tampere, Finland, and founded 10 years ago, Traplight has a clear mission to “Build fun games and have a great time doing so!”

2 years ago, the company started working on a new prototype and wanted to validate whether it was worth making the game.

Source: Traplight

After many iterations, Battle Legion was born, an asynchronous player versus player (PvP) spectator battle game. It has deep strategy elements, no player action is required during the matches, and has a unique “slot machine like” interaction for starting new matches.

Since the first iteration of the game, back in March 2018, Traplight has been able to increase significantly the retention at D1, double the retention on D3, and retain 4 times more users on D7.

Here is how they tested this and what they learned in the process.

Before testing

Before starting the engagement tests of the initial prototype at scale, Traplight:

  • Designed basic user acquisition (UA) creatives: 20 second video ads showcasing the gameplay. The goal was to determine the cost per install (CPI) rates as UA is key to understand the economic viability of the project.
  • Set up the basic analytics infrastructure: to be able to extract test results in a consistent and reliable way.
  • Conducted initial usability tests on the first time user experience (FTUE) and the user interface (UI) with focus groups to get confirmation that players would not get lost, even though the game was in the early stages of development.

Plan the beta test

Next, Traplight wanted to ensure that the test data is reliable, which they started to address by standardizing all UA test audience parameters. This was extremely important, as even slight changes in audience targeting or creatives could skew the results.

For these initial tests, Traplight wanted to target games with a similar theme (fantasy) and ensure the data was relevant enough without expending huge quantities of resource. After considering the options, Traplight decided to run these tests in Brazil, a country that provided them with a big enough user pool at a reasonable cost (under $0.25 per user). According to Traplight, making sure the test data is reliable is critical and starts by standardizing UA tests parameters along the whole test (see below).

Source: Traplight

One important note is that, together with the paid traffic, Traplight wanted to achieve as much organic traffic as possible and released the game in open beta in several markets.

Another critical step for Traplight was to set the key metrics to track during the test, while defining the objectives for each case. These clear objectives enabled the team to structure the testing cycle and prioritize the features to test based on predicted impact to the targets.

For the tests described below, Traplight set the following engagement metrics targets:

  • Retention on D1 of at least 40%
  • First session duration of 20mins
  • D3/D1 ratio of at least 70%

The next step was to release the first version to public beta and iterate on the core test.

In this first set of tests, Traplight were aiming to improve their short term retention, measured in D1 and D3 retention rates.

Breaking down this process:

Beta 1

The objective of this test was to get a baseline for retention and technical testing incidents such as crash rates. For this test the gaming included only basic gameplay and content for 1 or 2 days. The goal of all subsequent betas was to increase engagement metrics, such as playtime, retention, and alike.

Source: Traplight
  • Results:
    Retention D1: 44%
    Retention D3: 18%
    D3/D1 ratio: 41%
  • Learnings: D1 retention suggests a solid core, but the lack of content leads to a weak D3/D1 ratio, no major technology issues.

Beta 2

The objective of this test was to see how adding a basic game economy might create long-term goals for players. The features and content added for the test included an upgrade system for battle units, basic currency, and additional content.

Source: Traplight
  • Results:
    ➢ Retention D1: 51% (+7pps)
    ➢ Retention D3: 22% (+3pps)
    ➢ D3/D1 ratio: 43% (+2pps)

Learnings: increased grinding (performing repetitive tasks) led to a deceleration in the content burn and daily playtime jumped from 30 min to 48min with similar improvements seen in daily retention figures.

Beta 3

The objective of this test was to measure the impact of adding more content, primary more battle units, and making cosmetic improvements to the UI including a new first-time user experience.

Source: Traplight
  • Results:
    ➢ Retention D1: 48% (-3pps)
    ➢ Retention D3: 23% (+1pps)
    ➢ D3/D1 ratio: 48% (+5pps) due to the decrease of retention on D1
  • Learnings: additional content is a fuel for progression systems but the lack of growth suggests it doesn’t affect D1 or D3 results as players may not reach new content in that period. The new FTUE funnel reduced D1 retention rather than improving it.

Beta 4

The objective of this test was to measure the impact of adding timed session control, idle resources, and longer progression systems with daily targets.

Source: Traplight
  • Results:
    ➢ Retention D1: 56% (+8pps)
    ➢ Retention D3: 39% (+16pps)
    ➢ D3/D1 ratio: 69% (+21pps)
  • Learnings: time limits for sessions help to reduce content burn and ensure users are motivated to return on D3. Idle features and longer progression paths give users more to do in the initial stages as well as considerably more end-game content, so retention is improved with higher improvements on D3 than D1.

Beta 5

The objective of this test was to measure the impact of improvements in the first time user experience, more content, and the addition of a side quest system.

Source: Traplight
  • Results:
    ➢ Retention D1: 59% (+3pps)
    ➢ Retention D3: 41% (+2pps)
    ➢ D3/D1 ratio: 69% (+0pps)
  • Learnings: onboarding improvements and additional content continue to secure higher retention, but improvements are minimal and the D3/D1 ratio is steady, suggesting we are close to a plateau.

A common learning from all of the betas was that meaningful upgrade system and session control affect retention and engagement the most. In the long-term, having enough content is key.

These tests show that, on top of a fun and exciting prototype, before testing a game it is useful to have some basic elements in place such as onboarding, video creatives, and a basic analytics infrastructure.

Moving the needle past 70% D3/D1 ratio is hard and, having completed this testing, Traplight shifted focus towards D7 and D30 retention.

When it comes to paid UA tests, making sure that the campaign parameters are standardized before starting is essential. Using an open beta can be very helpful to get organic downloads and increase the player audience to test features.

Lastly, deciding beforehand on which metrics to test and more importantly the objectives to achieve is essential to prioritize features. In the case of Battle Legion, core progression systems and soft limits had the biggest impact when it came to short term retention rates.

Testing monetization during post-launch: SuperScale and Tanks A Lot!

With over 70 specialists in 6 locations in Europe and America, SuperScale is a profit-scaling company that works with developers who have ambitions to be top grossing and helps them maximize their success. Among many others, SuperScale helped Boombit scale up their game Tanks A Lot!, and improve its monetization by running several tests.

Tanks A Lot! is a multiplayer online tank battle game where 3 players compete against another 3 opponents in real time.

Source: Tanks A Lot!

Why monetization tests are important

According to SuperScale, the majority of the mobile games aspire to become a game-as-a-service business. The crucial challenge developers face is that games inevitably get older and a constant influx of fresh titles means the effectiveness of user acquisition campaigns reduces, often leading to an increase in CPI.

Increasing the lifetime value (LTV) of a game becomes critical. Developers, therefore, need to work on both areas that can affect this metric: engagement and monetization.

See ‘Predicting your game’s monetization future’ for more information about calculating LTV.

Based on SuperScale’s experience, engagement and retention improvements post-release hard to achieve as they involve costly re-development, while monetization improvements with significant business impact are often easier to come by.

Planning the monetization test

▶ Prioritize and build a hypothesis

When it comes to monetization improvements, SuperScale often uses a “lowest hanging fruit” approach to identify opportunities to improve. To do this, it’s helpful to look at the current performance metrics and compare these to a peer benchmark. Additionally, analyzing the cumulative average revenue per user (ARPU) can quickly identify issues with the monetization of the game.

In the case of Tanks A Lot!, SuperScale realized that, while D30 payers continued to play the game, these were paying much less than was expected: the average revenue per user (ARPU) curve was flattening too soon.

Source: Tanks A Lot! — ARPU curve to D63

Considering there were no obvious technical or game design bottlenecks identified in the game testing, the game economy seemed the most obvious solution for three main reasons:

  • Changing rewards or balancing the design of game items is generally quite easy and the impact is substantial.
  • There is no need for new content, just better use of existing systems and content.
  • Often has an immediate, significant, lasting, and measurable impact on a game.

For Tanks A Lot!, SuperScale had the hypothesis that a small change in the capabilities of game items (more specifically an update of gun power level) could impact the whole game economy and increase LTV and so began testing.

Source: Tanks A Lot!

▶ Defining the metrics to test

Before diving into the tests themselves, SuperScale had to determine which metrics to optimize. While average revenue per daily active user (ARPDAU) as a key monetization metric is popular, it has some disadvantages: it is hard to measure cannibalization or long-term effect of a monetization change.

Instead, ARPU at a certain day (for example, D60) against a control group, in SuperScale’s opinion, provides a better picture of the overall impact of a change and enables precise evaluation of the effects of a change on retention and cannibalization.

In Tanks A Lot!, SuperScale, chose to focus then on ARPU at D30 and D60 as the key metrics to optimize and where to expect the biggest possible uplift.

Executing the test

Having reliable data is critical, so SuperScale set thresholds to be sure the sample was statistically significant. More precisely, for this case the sample should include:

  • 400–500 payments.
  • 150k players.

These tests would involve changes in monetization, something players may find controversial, so community management was involved to minimize the testing impact. To mitigate this risk, during the economy A/B test in Tanks A Lot!, SuperScale ran the tests only with new players.

Analyzing the results

There are, generally, 2 outcomes from a test:

  • Conclusive: the test has a clear result:
    For the change: The results show you should make the change
    Against the change: The results show you should not make the change
  • Inconclusive: the test did not give clear answers, in which case go with the “gut”.

In Tanks A Lot!, SuperScale measured a +25% ARPU D60 uplift without any impact on retention. Even better, the community praised the changes on social media saying that the game felt more balanced. SuperScale, therefore, proceeded with the full roll-out.

Taking into consideration that the change was about two weeks’ work of a game economy designer, the return on investment of this change proved to be substantial.

ARPU curve at D63 before (blue) and after (red) the test

Takeaways

Effective testing has clearly made a huge difference to the lifecycle of these games, generating more revenue per day and keeping players passionate for longer. These examples illustrate the need for care when testing: how you define your A/B groups, the number of changes made between iterations, and what you take from the results are all highly sensitive factors.

When it comes to the execution of the test, having reliable data and a consistent method is critical to success. It’s also important to remember the sensitivity of players to groups being treated differently, especially with monetization tests, so plan ahead to avoid these issues.

Testing best practices

From these examples, we have learned that to test like a pro we need to follow a clear 3-step process. Here are some best practices for each of those steps:

Planning

  • Define the game’s KPIs and identify any metrics that are below market benchmarks or failing and features that can be further improved.
  • Have a clear objective and hypothesis.
  • Create a pre-testing procedure.
  • Make sure your data is reliable: set sample requirements and test for long enough to achieve them.

Execution

  • Tools and methodologies are important, so spend time deciding on these.
  • For in-game testing, have KPI thresholds to determine whether to continue with the tests and establish milestones for the test.
  • Consider assigning a budget to test marketing and UA regardless of ROI, for example, 10% for UA marketing tests.

Results

  • Run no more than 1 or 2 in-game A/B tests at the same time and ensure they are mutually exclusive to not impact each other.
  • Monitor the right KPIs, with a long term metrics focus.
  • Extract key learnings, take into account that many tests (some developers mentioned that around 50%) will not be implemented.

Testing and the Google Play console

The Google Play Console offers several testing tools for before launch and live game testing.

When running tests before the launch, the Google Play Console has several distribution tracks (internal app sharing, internal testing tracks, closed testing tracks, and open testing tracks) which offer a number of benefits, such as:

  • Reviews are private to the developer.
  • Faster rollout of APK to beta testers in the internal app sharing and internal test tracks.
  • Full reporting in the Google Play Console: from device performance to Android Vitals and user acquisition reports,
  • Ability to test conversion in the Google Play Store using store listing experiments.
  • Options to run paid acquisition campaigns.

Additionally, the production track can be useful for full user acquisition tests on live games.

Want to know more?

This article is part of a series drawn from Playtime 2019. You can find the full list of presentations here.

--

--