Insights on Churn Prediction Complexity

If only conducting a churn prediction was like competing in a Kaggle competition.

You already have a data set, a great infrastructure, a criterion to measure success of your prediction and your target and features are well-defined. You only have to do some feature engineering and test some algorithms and voila you got something. I wish it was that easy.

In reality, it’s quite complex. I have been working on Churn in the mobile gaming industry for quite some time and this article will expose some of the complexity related to this kind of prediction. Let’s think of some questions we have to answer before conducting a churn prediction.


  1. How do you plan to use your result in order to generate ROI? That is extremely important, yet too often neglected. If a certain user as a high probability of churn, how do you intend act? Is the infrastructure and/or the organization structure enable to act? There is no point in shipping something complex that cannot be acted upon.
  2. What is the goal of the prediction? Is it to understand or to predict users that are likely to churn? This often leads to two different sets of algorithm you will use.
  3. Are there benchmarks or best practices in your industry that upper you should be aware of? This will certainly have an impact on what you will try to explain and/or predict churn.
  4. How do you measure the success of your prediction? If you are considering churn as a binary classification problem, a user that do not churn can be pretty rare. You might be tempted to use the Partial area under the ROC curve (PAUC) rather than the AUC or a custom weighted metric (score = sensitivity * 0.3 + specificity * 0.7).
  5. How do you know when to stop improving your prediction and ship your first version? Once you have establish your score criterion, you must establish a threshold. the threshold value will inform you when too ship.
  6. Have you agreed with stakeholders on the measure of success of your model?
  7. Do you want to predict all your user or only a segment? For example, if you want to predict new user or veteran user, you might not use the same features.
  8. What is the target? It’s interesting, but you could treat churn prediction as a classic binary problem (1 : churn user, 0 : not churn), but you could also consider the Y as the number of sessions played.
  9. What are your features? Are you going to use time dependent features? Base on your industry knowledge, What are the most important features to predict churn? Are they easily computed base on your current database? Have you considered the famous features RFM?
  10. How do you get the data? Do you need to make batch SQL calls to gather the desired data shape or do you need to start with streaming data or do you intend to evolve from batch to streaming as you ship better version?
  11. How do you intend do clean your data? How do you deal with missing, aberrant and extreme value that can screw up your model.
  12. As you code, what should you test ? Google Published a great article that provides a set of actionable tests to help you get started.
  13. How do you push your work in production? Do you need to push your prediction result in a database or do you need to make your model available as a REST API. How do you collaborate with the software engineer.
  14. How are you going to monitor the quality of your prediction in production? Are you going to create a dashboard? Will you create a table in your database containing your log? What will you monitor ? How do you intend to act base on certain KPI of your prediction quality?
  15. What is the maintenance process of your model? Do you intend to make a change to your model once every month, every quarter? What are you planning to change and what is the history of the changes.
  16. What are your deliverables? How to you plan to improve your model over time?
  17. Which tool are you going to use? In must cases, you will be using either R, Python orSpark. Spark is recommended when you really have big data (10 millions rows to predict). Note that the Spark ML library is limited in contrast to the top Python Library (tensorflow, sklearn…)
  18. Which Packages are you going to use to make a prediction? There are so many packages are there (sklearn, H2O, TPOT, TensorFlow, Theano…. etc)
  19. How do you deal with reengage user? What if you decided to predict user that will churn within the next 30 days ? What if they come back after 30 days? How do you consider this bias within your analysis.
  20. Which algorithm are you going to use? It could be Random Forest, SVM, Linear Regression, Neural Network or another. You could also use an automatic Machine learning to find your first model. Whatever you decide to choose, you need to know the data assumption of each algorithm.

Here is the best part, you have not even started to code…


Links to Great Articles


Worth Sharing?

Please press the💚 right below you. It helps more people find it.