If there’s only one option, how can we know it is the right one for us? It seems more like a philosophical debate, but actually, there are many recurrent situations in which a company needs to answer these kinds of questions. Today, I’ll present you a real case in which data science helped us understand if we made a good decision without running an experiment.
Wildlife Studios is the first video game company I’ve ever worked for. It’s fun and challenging — not only because of the magic of the games but also the magic of advertising. Have you ever thought about all the things that happen behind the advertising videos that appear when you play a mobile game? Many things happen, and they are exciting.
What is an SDK?
SDKs (Software Development Kit) are packages of code that fulfill particular parts of an app’s functionality. In general, in the mobile industry, effectively reaching users with an ad relies on a specific SDK responsible for connecting the app with the advertising market. Part of the action and my work occurs when this software needs to be improved.
For that, a new version is created, tested, and finally deployed on production. But even after testing it many times, we may still wonder whether the latest version is indeed better than the old one. Even when we are sure it is better, we still want to know how much better it is.
A/B Testing and SDKs
Usually, when this type of question arises, data science answers them through experimentation, for example, designing an A/B test. In this case, we could release the new SDK for some randomly chosen users (treatment group) and keep the rest of them with the old version (control group). Then we would compare our performance in the new SDK group versus the old SDK group.
But in the app world, we can’t show different SDK versions for different users. Here’s when things get complicated because the A/B test is unfeasible.
You can argue that several versions can coexist at the same time because some users still have the old version, and you would be right. But it would be a mistake to simply compare the group of users who have the new version with those who have older versions. It’s easy to imagine a situation in which both groups aren’t comparable. For example, a user whose device has enough free storage space is more likely to have the latest version, while in the opposite case, may not. So, the direct comparison could introduce a bias in our analysis.
How can we measure the releasing impact of a new SDK version?
The idea is to estimate what would have happened if we hadn’t changed the SDK version. There is no way to answer that question without proposing a counterfactual. A counterfactual is a situation that could have happened but did not really happen.
In this way, what has actually happened will be our treatment group, whereas the counterfactual will represent our control group. Then we can compare both cases and draw conclusions about the effect of the version change.
A more technical approach
Maybe, the most straightforward reasoning is to suppose that nothing would have changed if we had not changed anything. Let’s imagine that we are in a case where our metric of interest is always approximately within the same values. As we can see in the following diagram, it would be reasonable to think that the metric would have kept its value in the absence of changes.
But now imagine that the metric had a positive trend over time. It would be a mistake to propose a counterfactual that simply replicates what happened in the previous period. As we can see in the following example, the comparison with the counterfactual would overestimate the impact of our new version.
So to improve our estimate of the counterfactual situation, we should consider those pre-existing elements that somehow governed the evolution of the metric. These would be, mainly, the trend, the cycles, and the seasonality of this series. A counterfactual that includes these elements would give us a reasonable basis for measuring the impact of our intervention.
However, it is not that simple because although we are now in a much more comparable situation, unexpected factors may affect our inference. For example, there can be abrupt but perhaps temporary drops in some big buyers (adnets).
In that case, something that can help us is to analyze what happens with other games which don’t use our new version yet and build a more robust counterfactual in the face of those unexpected events.
R, Causal Impact and the Google team
Now we need a method capable of incorporating all the necessary elements to build a good counterfactual. An essential part of the task is learning to predict the time series. For that, we must capture characteristics such as trends, seasonality, and cycles. But also, we have to choose which of all the other games we will use to complete our counterfactual scenario. The Google team proposed an exquisite way to solve all these challenges at once. Thanks to their work, we have available both an article with the academic explanation and an R package called CausalImpact.
Proposed Solution: The Bayesian Approach
The solution proposes to use a structural model of time series within a Bayesian approach. Don’t be scared. The important thing is this type of model will allow us to express each of the components of the counterfactual.
I won’t tell you how to get trend and seasonality, which you can learn in any time series analysis course. I’m going to jump straight to the most interesting part of the estimation, which is how we will use the information that comes from the other games.
The best scenario that we could find is a game or a group of games that behave in a very similar way to the game we are trying to analyze. But actually, we see many games that may have a little or null correlation with the time series that we are interested in modeling and a few games with a considerable correlation. So, the art will be in finding the group of games with a strong correlation.
How does the package tackle this difficulty? Using a Bayesian variable selection technique called spike-and-slab, we will determine to what degree we should include a certain contemporary covariate (a game) as a predictor of the time series. It allows us to decide which games to use and what weight to give them. And best of all, since this variables selection technique is helpful in scenarios where the number of predictors exceeds the number of observations, we can test as many games as we want!
Finally, how were the results? Robust in most cases. In the following graph, we can see how the launch of the new SDK had a neutral impact on this particular game. You can see this by comparing the actual metric (solid black line) with the counterfactual (dotted blue line) after the new SDK release (first vertical dotted grey line). For privacy reasons, the absolute values are eliminated, but we can see that the margin of error is relatively low.
We also had a good experience in other games, in several ways. Not only showing the positive impact of the SDK release but also predicting outliers like the one marked by the red circle. In particular, the drop in the following graph was related to a generalized situation after the presidential elections in the United States. Points go to having used this technique that includes other games as predictors!
Implementation in R
The best of all is that the implementation does not require significant effort. A simple demonstration is that no more than ten lines of code are needed, as shown below.
# Import the librarylibrary(CausalImpact)# Load the data frame. Each column represents the time series for each of the games, ordered in ascending order by time. It is important to mention that the first column is the game to be evaluated and the rest of the columns are the other games that will be used as covariates.df <- SparkR::sql(“SELECT * FROM temp_df”) %>% SparkR::collect() %>% as_tibble# Delimit the training period of the model (pre.period) and the counterfactual period (post.period)pre.period <- c(1, 60) # We use the first 60 days for trainingpost.period <- c(61, 75) # We evaluate the impact on the following 25 days# Fit the model (parameterization is optional)impact <- CausalImpact(df, pre.period, post.period, model.args = list(nseasons = 7))# Evaluate the modelplot(impact) # Visual inspection of the fitsummary(impact) # Numerical summary of main statistics and model fitsummary(impact, “report”) # Written summary of main statistics and model fit
The results were very robust and allowed us to quantify the impact of generating a completely renewed version of the SDK. The possibility of using other games as predictors was vital. And finally, quickly obtaining results was advantageous as the Causal Impact package is quite intuitive and requires little effort to implement.