How We Strengthened Our Approach to Predicting Lifetime Value

Published in

Pocket Gems Tech Blog

6 min readJul 28, 2020

There is no 100% accurate way to predict user behavior, but we get as close as possible by continuing to update our process. We focus on estimating lifetime value (LTV), the total monetary value of a player within a certain timespan after installation that is typically referred to as the “payback window.” LTV is the central factor in much of our decision-making. For example, it determines how much the marketing team can spend on acquiring new users. It also serves as a way to measure each mobile game’s overall performance and helps teams to identify user subsets that might benefit from special attention.

Getting an early estimate of LTV is crucial for us to quickly react to changes in user behavior instead of waiting several months for the full payback window to elapse. We accomplish this by building machine learning models for each of our games. One of our greatest hits, “War Dragons,” has an LTV model that we’ve greatly improved. We’ll go through that progression in detail for others who may be configuring their own LTV strategy.

Model

For illustration purposes, let’s assume that the payback window of “War Dragons” is 180 days. If we want to make an LTV prediction seven days after the user installs the game, then the problem can be modeled as the following equation:

X is a vector of features that summarizes user behavior during the observation window (Day 0 to Day 7) and the target variable is the cumulative revenue that the user generates during the prediction window (Day 8 to Day 180).

Challenges

There are three main limitations when it comes to predicting LTV for “War Dragons.”

The process requires us to profile user behavior within a small observation window and then make predictions stretching throughout the entire user life cycle. This leaves us making long-term predictions based on short-term data.
Free-to-play games get the majority of their revenue from a small portion of users. This often results in extremely skewed and high variance distributions of LTV.
We don’t know the target variable until 180 days after installation, so our LTV model will always be at least six months out of date.

Baseline Performance

We built our initial model using LightGBM, a fast and powerful gradient-boosting package. We included features such as country, device type, total day-7 sessions, and cumulative day-7 revenue. The table below summarizes the performance of this model.

Baseline Model Performance: 1. All metrics are calculated using out-of-sample data. 2. Normalization by the mean. 3. The NRMSE of the entire user base is much higher than that of Purchasers only because the slight over-prediction of users with zero LTV will lead to a huge NRMSE. After limiting to Purchasers only, the NRMSE is significantly lower.

The initial model was a great start, but we soon became dissatisfied with the systematic 14% over-prediction. After a year had passed, we set out on a journey to enhance it.

Model Improvements

Feature Engineering

The original features were mostly snapshots comprised of the user’s status on day seven, which included things like levels reached, number of sessions, and total spending. As we explored the input data and investigated the prediction results, we found that users with the same snapshot features could have widely differing behaviors during the observation window, and different LTV as a result.

Since the short observation window provides such limited data, we decided that the efficiency of feature engineering made it a better way to extract characteristics of player behavior. We then categorized those characteristics into three dimensions: Retention, Monetization, and Engagement/Social Interaction. We hypothesized that the model would also benefit from additional information in those realms to capture the progression of users. So, we added more dynamic features that would consider changes in player behavior over time. Some examples include:

Gradient of daily purchases: Does this user spend more or less over time?
Standard deviation of number of daily sessions: Does this user log in everyday and play as a habit, or does the user only play when there’s an event?

These features turned out to be strong indicators of LTV and greatly improved the model’s performance.

Loss Function Alternatives

By default, the model was trained to minimize the Mean Squared Error (L2 loss), which is the mean of the squared difference between the target and the prediction. This type of model is sensitive to outliers. Unfortunately, the nature of free-to-play games like “War Dragons” makes their LTV distribution highly skewed towards outlying high spenders. This is especially problematic for gradient-boosting models because they build trees on previous errors and therefore focus a disproportionate amount of attention on outliers.

The baseline model performance suggests that the model was optimized to minimize the under-prediction loss for a handful of high spenders at the expense of over-predicting for the vast majority of regular users, which reduced its overall performance. One way we addressed this was by incorporating another loss function that is more robust to outliers: Mean Absolute Error (L1 loss), which minimizes the absolute residuals rather than the squared residuals.

Finding the sweet spot that balances typical users and outliers can be tricky. In practice, it’s always about the business goal. L2 wins when the company intends to recruit more high spenders, while L1 is preferred when the model is used for finely-tuned marketing campaigns that precisely target specific user populations. Luckily, there is a loss function that combines the best properties of both: Huber Loss. It is quadratic (L2) for smaller errors and linear (L1) otherwise. By selecting a δ parameter that fits the business case, the model can maintain robustness against large residuals and is differentiable at its minimum.

Updated Performance

By implementing the above changes to the model, we were able to improve its performance by roughly 20%.

Model Performance of Baseline Model and Improved Model

One more thing…

Our job wasn’t done after we deployed the new model into production. There’s something else to keep in mind: mobile games evolve very quickly. User behaviors, statistical properties of LTV, and any hidden relationships therein can change over time due to content updates, in-game events, market competition, etc. We address this issue by closely tracking the model’s performance and regularly updating it to include the most recent historical data and best capture new relationships. We have found that updating the model on a monthly basis can improve its performance by about 3%.

There are additional ways to tackle the challenges posed by continuously evolving user behavior. For example, applying higher weight to newer data helps the model to focus on recent relationships. Another strategy that our team has explored is Temporal Difference Learning, where the training signal is a subsequent model prediction instead of actual LTV. The model is adjusted to bring the old prediction into line with the newer prediction. This can enable the model to take advantage of recent data without needing to wait for the payback window to elapse.

Closing

Predicting LTV is a tricky business. By applying intensive feature engineering techniques and using a loss function that’s more appropriate for our needs, we’ve been able to make substantial improvements to the performance of our model. Moving forward, we will continuously look for other methods to make our predictions even more accurate.

We hope you found this insightful! If you’d like to learn more about what we do at Pocket Gems, take a look at some of our other blog posts, or if you’re interested in joining Pocket Gems, we’re hiring!