How to Use an Obscure Statistical Technique to Market Your Games

Pedro Carvalho
Wildlife Studios Tech Blog
11 min readMay 6, 2020

One of the most important problems, or perhaps actually just the single most important problem, that a data scientist working with mobile gaming will want to try to solve is the problem of user lifetime value prediction.

First, it’s necessary for us to know what the actual profitability or size of the game is. A number of decisions within the company hinge on this: how many people to allocate to its continuous development, whether it was worth the investment, what kinds of features drive players the most in the game.

We should also think of the shareholders themselves: the people who have a very real stake in the success of the games and who want to see their investments give fruit. In the long term, they need to be able to have reliable figures to assess which projects to back and support.

Those reasons would probably be enough to at least give you pause. This blog post, however, focuses on the third great use for LTV prediction — namely, user acquisition, and, more specifically, how to use a modified Kalman Filter to generate those predictions for users that have yet to install our apps.

User Acquisition

You may create the single most interesting, entertaining, engrossing, and well-thought-out mobile game out there, but if you have no way to tell people about it then no one will play it. That’s where user acquisition — in the form of advertising — comes in. From the very first users to install your app to users who only downloaded it because they saw it on a ranking list, almost everything has come from someone having seen ads for it somewhere.

That’s all well and good, but what does that have to do with user lifetime value prediction, you might ask? Everything, dear reader.

There are many details of the user acquisition ecosystem that I will be eliding over now to explain this to you, but the fundamental connection between those things is quite simple: to show an ad somewhere, you need to pay someone, and to decide how much money you’re willing to spend on that ad, you need to know how much money that’s going to net you in return. That is a simplistic way to put it, but I do think it gets to the heart of it.

Projected versus expected LTV

It is perhaps not immediately clear, but upon some reflection, I think you will agree that user acquisition efforts need to know player LTV in a somewhat different way than shareholders or other people making resource allocation decisions. To acquire users through some medium you need to have some model of how much money future users you’ve never interacted with before are expected to give you. No such constraint applies to generating projections of game profitability, where you can in fact observe user behavior for a while and use that data to great effect in order to create those estimates.

Here at Wildlife Studios, we have different names for those two results:

  • The projected lifetime value (pLTV) project aims to create very precise predictions of how much money our currently-existing user base is likely to spend on our game.
  • The expected lifetime value (eLTV) project tries to guess how much money we should expect to get from the average user we will acquire, in the future, from the various sources we work with.

It’s straightforward to picture how one would go about generating our pLTV estimates. Even if you are not personally acquainted with predictive algorithms, I’m sure you can agree that there must be some amount of data plus buzzwords like “machine learning” and “regression” and “AI” that would allow one to infer the future from the past. And I don’t want to imply that that is not an interesting or difficult problem; au contraire, it is so interesting and difficult that I’m not even allowed to talk about it openly.

But when we’re talking about predicting the unseen, the revenue of users future, where to even begin?

Underlying values and how to guess them

Imagine, if you will, that you had an oracle of user LTV, but only after they installed your game. Equivalently, imagine your pLTV model is amazing and you trust it implicitly.

Then imagine that there’s some existing ad campaign you’ve been acquiring users with — it could be, say, some video that’s played on Facebook to a certain relevant demographic of users. And so, of course, you can use your oracle once they’ve downloaded your game to know exactly how much money each of those users will spend on it until they stop playing it.

Suppose one day 150 users download your game from that campaign, and they will spend, on average, $5 on your game before they stop playing (according to the oracle). The next day, 200 users download your game and they will spend an average of $4. The next day, 100 users with an average LTV of $4.50. What would you guess users on the fourth day would spend, on average?

  1. $0.50
  2. $20
  3. $4.50
  4. $0

Of course the fundamental answer is that you don’t know for sure. But you sure do have some intuition here, don’t you? Or, put it another way: you would be pretty surprised if you found out that the fourth-day users had an LTV of $50 on average, wouldn’t you?

Users who installed your game after seeing that specific video probably have something in common. It doesn’t really matter what, and it doesn’t even really matter whether Facebook had something to do with it. We could spend all day speculating about the psychology of users who are more interested in simplistic banners while playing sudoku versus those who will only be interested if they see five minutes of gameplay before a YouTube video. If you have reliable information about the average LTV of past users from a given segment, guessing that future users will be similar to them seems like a very safe bet.

The Kalman Filter

I am going to take a pause in the revenue chat to talk about a remote-controlled truck.

Imagine you own one. However, you don’t have a camera or anything like that; instead, you’re tracking it through GPS. Now, you of course need to actually drive that truck, turn when it needs to turn, etc. So it would be very useful to have an accurate idea of where this truck is at any given time.

The truck, at a given time i, is at position xᵢ and has velocity vᵢ. You have, of course, direct access to its initial position xₒ and its initial velocity vₒ = 0. You also have access to estimated position data from the three satellites currently tracking the truck, which you could call gᵢ, and your remote control, which applies some acceleration aᵢ to it.

There is also, of course, noise in all parts of the system. In particular, you could say the GPS estimates are normally distributed around the true value:

At any given instant, the velocity is given by:

(Because there is some noise also in transmitting your acceleration to the truck). Finally, the (true) position is given by:

This very simple model is an example of what is known as the Kalman Filter.

In the literature and normal explanations you will find notation that I personally think is somewhat more obscure than the above, but it is still fundamentally the same: you have some hidden variable (such as, in our case, xᵢ) that you want to reason about, some way to noisily observe it (gᵢ), some way to control it (aᵢ), and some transition relation that turns all of those things at time i - 1 into the unobserved variable’s state at time i. And in the end, you can use all of that to find some posterior distribution for the unobserved variable, conditioning on your previous observations and control inputs.

LTV as an underlying value

The title of this section probably gives the whole game away: we can model LTV as a hidden variable we’re trying to infer based on our past observations using the Kalman Filter I just talked about. We will, however, simplify the model by a lot and then take a slight left turn somewhere else.

The two main simplifications are: 1. we do not assume we have any fine-grained control over LTV and 2. it has no terms of long-term trend or seasonality. These things imply that the evolution of a given segment’s LTV over time (in this case, each “instant” of the model is one day) is a random walk, with no control term and with the state transition matrix being the identity matrix:

This “underlying value”, however, is not the same as the observed LTV of a given segment on a given day. Rather, you should think of it as the “expected value of an average user from that cohort”. Another way to think about it is what that cohort’s LTV would be if it had contained infinite people.

For any given day that we have some users from some segment installing our game, a lot of stuff we might want to term “noise” may happen. So we could plausibly say that, actually, the observed LTV of each given specific user follows some distribution around this “true”, underlying value.

We run into a little bit of trouble here, though, because while I could maybe handwave something for the random walk that the LTV itself follows, it would be patently absurd to say that observed LTV is normally distributed around the true LTV — for one, a large number of users, probably a majority, will just never spend a single cent on the game.

Let’s take my aforementioned left turn here and think about the Central Limit Theorem: for a sufficiently large number of installs on a given day, Nᵢ, the average of all of those users’ LTVs will in fact be normally distributed:

That lets us dodge one bullet, but there’s another one coming our way: we don’t have recent observed LTVs! We have only recent observations of average revenue per user (ARPU), and for a fairly low number of days since install at that! We can’t bid on users based on what recent users have spent on their first day, or if we’re charitable first week, of play.

But we do have something I mentioned earlier to save us: existing users’ projected LTV. We can do a similar Central Limit Theorem trick here:

And we can skip the middleman there and integrate the observed revenue right out:

What we really want here, though, is not the underlying LTV for some cohort of past users who are already in our game; we want, instead, the LTV of future users. According to our model:

What we have is the sequence of projected LTVs. But then we can just do a little bit of maths here:

The step from the first to the second line there is straightforwardly a consequence of the conditional independence embedded into our model: tomorrow’s (true) LTV is independent from today’s projected LTV and yesterday’s projected LTV and so on when conditioned on today’s (true) LTV.

Now, entertain me for a second and pretend that the probability density function is given by

is actually a normal distribution with some unknown mean and variance:

Then, it is actually pretty straightforward to show that:

So actually, our best guess for tomorrow’s true LTV… is the same as today! This, of course, follows trivially from our model definition as a random walk: if we have no further observations about tomorrow’s LTV and all that we know about it is that today’s expected LTV was μₙ, then, of course, that’s what we’ll think about tomorrow. So now all that’s left is to prove that that huge mess up there is actually normally distributed.

Spoilers: it is.

The second term of the right-hand side there on the third line looks suspiciously like something we’ve seen before, doesn’t it?

So we’ve come full circle, and discovered a recurrence relation:

Let’s start this from 0, then. As sayeth Bayes:

If we give our LTV sequence a prior, something like:

Then it follows immediately that:

With the following definitions:

Something begins to take form! Now let’s go on to the next step.

We can have our full definitions, then, for the recursive relation.

With the definitions for the case where i = 0 given above. And, of course, it follows that for the “future” LTV we’re estimating:

This and more at a mobile gaming company near you

At a higher abstraction level, something like the above result makes a pretty intuitive sense. As in the example I gave, we always expect that:

  1. Future users will be similar to past ones.
  2. Cohorts with more installs provide more information than cohorts with fewer installs.
  3. More recent cohorts are more relevant than ones further in the past.

So this modified Kalman Filter can be interpreted as a sort of “moving average” that encodes those assumptions (plus some other more technical ones but anyway). That’s pretty pleasant both from a computational perspective and from a business logic perspective, as it resonates with what one would assume a priori.

Bonus points go for you not even needing all of it if you assume your user acquisition operation is risk-neutral and thus you are only interested in the expected values and not any tail risks.

This is just a small (but still quite central) part of the much larger revenue prediction project at Wildlife Studios, and it was entirely developed by data scientists. It’s the sort of work my colleagues and I do on a day-to-day basis, and I personally find it quite exciting. Who knows what we’ll do in the future?

(Me. I do. But I’m not telling. Shhh.)

--

--

Pedro Carvalho
Wildlife Studios Tech Blog

Data Scientist at Wildlife Studios, aspiring rationalist, tireless nitpicker, opinionated meddlesome kid.