Quantifying the Relationship Between App Performance and Player Retention: Part 1

Published in

Pocket Gems Tech Blog

6 min readJul 29, 2021

By Maxim Levet

Background

Imagine an app that began to crash every other time you loaded it. You’d almost certainly use it less. With that said, exactly how much less? 5%? 50%? Would you stop using it altogether? Conversely, what if it went from crashing occasionally to never crashing at all? Then, would you use it more? Or would you not notice the difference? As a mobile gaming company, it’s critically important that we understand these connections. We must prioritize our engineering resources to improve game performance in ways that will make the biggest positive impact for our players.

As such, we decided to tackle this problem by building a machine learning model to predict player retention as a function of each player’s technical performance metrics. From that point, we then took two separate approaches to interpreting the results. The first method used what are known as “Partial Dependence Plots” (PDPs). In this article (Part 1 of 2), we’ll describe this first method, discuss some of its advantages and shortcomings, and explain why we ultimately decided to try a different approach.

Methodology

A/B Testing: A Non-Starter

Before we delve into the details, you might be wondering why we didn’t just run an A/B test. Isn’t that the simplest and most robust way to figure out whether some change in the app results in a statistically significant difference in the metric(s) you care about?

You’re right — it is.

Then, what are we doing here? Well, let’s think this through. If we were to A/B test how an increase in crash rates might affect retention, we’d literally have to make the app crash more on purpose in the treatment group and then compare the results with those of the control group. Try pitching that: “Hey, so, we’d like to make the app worse, on purpose, so that we can figure out exactly how much our players dislike the experience.” Absolutely not!

Ok, well, what if we did the opposite? That is, make the app better and measure how much our players like it. While absolutely a worthwhile goal, it could take months of engineering work to accomplish and might end up making no difference at all! We would have wasted valuable time that could have been spent making other upgrades that meaningfully improve the experience of our players. In essence, the goal is to figure all of this out before the engineering work is done so that we can put our effort towards projects that will have the best and highest impact.

The Model

Now that we’ve shown why A/B testing wouldn’t have suited our needs here, let’s discuss how we actually went about solving our problem.

To begin, we built a machine learning model to predict each player’s retention rate as a function of several technical performance metrics as well as other device-based features. We used LightGBM, but, in theory, we could have chosen any supervised learning algorithm. Then, we trained the model using real-world data so that it could learn from the natural variation in these metrics to tease out the relationship between performance and retention without a formal A/B test.

Next, let’s discuss how we interpreted the outputs to get the answers we were looking for.

Partial Dependence Plots

We used PDPs for our first attempt to interpret the output of the model and better understand the relationship between performance and retention.

Basic Concept

Fundamentally, PDPs are pretty straightforward. To explain how they work, let’s use a simple example. Imagine our model only used two features, crash rate and device RAM, to predict a player’s retention rate. In functional form, it would look like this:

Eq. 1: Hypothetical model in functional form

In other words, plug-in crash rate and RAM and presto! The model spits out an expected retention.

But, what if we plugged in some data? Below are some hypothetical players along with the associated predictions made by our model:

Fig. 2: Predictions from hypothetical model

Note:

We’re using D7 retention above, but in theory, we could have used it any day. (Any player who logs a session on the Xth day after install is defined as having “retained” on DX.)
The above data are made up and are for demonstrative purposes only. As such, they do not necessarily reflect the true performance of any of our games.

Implementation

To build the PDP for the feature crash rate and plot how the model sees the relationship between retention and crash rate, we’ll take the above results and:

For each data point:

a. Vary the crash rate from the minimum value to the maximum value, while. holding the RAM constant

b. Plot the resulting retention predictions as a single curve

2. Average the results of each individual curve we just plotted to generate the final PDP

Pretty easy, right? To do this, we used a package called `pdpbox`. Here’s the basic Python code:

Where `model` is our fitted LightGBM model, `X` is a Pandas DataFrame of the feature set on which the model was trained and `y` is a Pandas DataFrame of the observed retention of every player in the feature set.

Below is the resulting plot after some slight modifications in `matplotlib` to clean it up a bit.

Note:

The thinner blue curves are the individual conditional expectation curves, while the thicker yellow-green curve is the average of all of those individual curves. The latter is ultimately what we care about.
Once again, the above data are for demonstrative purposes only and do not necessarily reflect the true performance of any of our games.

So, that’s it? According to Fig. 3, it looks like by reducing crash rates from 2% to 1%, we can expect D7 retention to go from 10% to 13%. In actuality, however, it’s not quite that simple.

Disadvantages

PDPs may be straightforward to understand and easy to plot, but they have a major disadvantage: they require that the given feature not be correlated with any other features in your model. In practice, this is often not the case. Just take another look at Fig. 2. It’s pretty clear that crash rate and device RAM are correlated (in this case, negatively correlated).

And why does it matter whether or not features are correlated? Because it means that when we draw the curves, we’re going to make some unlikely, or even unreasonable, combinations. For example, take the first data point in Fig. 2: when plotting that player’s individual curve, we’d have the model generate predictions for crash rates as high as 20%. But, with a device RAM of 4 GB, it’s incredibly unlikely that this player would ever have a crash rate that high. This makes any predictions in that range unreliable at best and throws the accuracy of the entire PDP into doubt.

Parting Thoughts

Estimating the effect of a change in the app without being able to A/B test it is challenging. With that said, we feel that our approach in this case was promising. By using some machine learning and a bit of clever statistics, we created a model and sought to use it to extract meaningful insights about performance and retention. All with the ultimate goal of using that information to better allocate engineering resources. Ultimately, it didn’t quite get us where we needed to go, primarily due to the PDP curves’ strict requirement around un-correlated features (a requirement that is difficult to achieve in the real world).

With this in mind, join us next time for Part 2 of this series where we’ll discuss a new — and we think better — approach to solving this problem… See you then!

***

We hope you found this insightful! If you’d like to learn more about what we do at Pocket Gems, take a look at some of our other blog posts. Or if you’re interested in joining Pocket Gems, we’re hiring!