User Churn Prediction using Neural Network with Keras

Fuaddi Yustindra
FiNC Tech Blog
Published in
6 min readOct 23, 2019

Based on users’ first-week behavior in the app, we create a model to predict whether they churn in the first month.

Background

The cost of retaining existing users are much more expensive than acquiring new ones. That’s why those in the Product and Engineering team made an all-out effort to keep users from churning by experimenting on new features every week. By knowing the churn probability of users at an early stage, would help the team to make better preventive strategies. In this article, you will see how to create a simple and decently accurate prediction model in Python using Keras neural network library.

Importing necessary libraries

Dataset

A quick look of raw data (the user_id doesn’t correspond to the actual one)

Our dataset contains a few basic info and 1st-week in-app behavior of 771,000 users registered over the span of 2 months. Since one of our app core function is the lifelog feature, its record is interesting to be included in this dataset. The dataset has 8 variables (4 categorical, 4 numeric).

  • gender (Female, Male, Undefined)
  • age_class (10s, 20s, 30s, 40s, 50s, 60s and more, Undefined)
  • onbo_completed (Yes, No): whether or not the user completed our onboarding process (basically asking user basic info such as name, gender, age, weight, height, health concerns, etc)
  • active_days: how many days users logged-in
  • weight_logs: how many times users recorded their weight
  • sleep_logs: how many times users recorded their sleeping time
  • meal_logs: how many times users recorded the meal they ate
  • churn (Yes, No): whether or not the user stopped using the app from day 30
From left: Onboarding process, weight log, sleep log, meal log

Basic Exploratory Data Analysis

Categorical variables (gender, age_class, onbo_completed)

Most of our users are female ranging from teenagers to those in their thirties. Undefined data most likely came from users who have not completed the onboarding process.

Numerical variables

Since we apparently have a problem with outliers in lifelog data (e.g. it is very uncommon for users to record their sleeping time 50 times, or record weight 100 times in the first 7 days), this time we would just do a quick-and-dirty step to remove outliers using classic Z-score method.

Z-score with threshold = 3.5
Numerical variables (after outlier removal)
Correlation matrix

All features have a positive correlation with each other. This is pretty obvious due to the fact that the more active a user is, the more likely he/she will be logging their data in the app.

Data Preparation

Preparing a model-ready input data is critical before feeding it into the neural network. First off, we convert the categorical data into numeric one using one-hot encoding. Each value in each categorical feature turns into a new feature that has only 0 or 1 value (hence the name).

Dataset after one-hot encoding

Then we apply the feature scaling to all non-binary features to get zero mean and a standard deviation of 1 as well as to speed up the computation. We would use PowerTransformer with Yeo-Johnson method instead of StandardScaler since our data relatively have non-Gaussian distribution.

Dataset after feature scaling

Creating the Model & Hyperparameter Optimization

Now comes the fun part, the dataset is ready and we can start building the neural network. We would split the data into 90% training set and 10% validation set. Since this is relatively a simple dataset, we just create a standard architecture as shown below:

  • 2 densely-connected hidden layers
  • L2 regularization (10³) (to prevent overfitting)
  • Binary cross-entropy loss
  • Output layer activation function : Sigmoid
  • Batch size : 1024
  • Epochs : 10
Our neural network architecture which takes 15 features as input and outputs either churn or no_churn.
A function containing our model overview

Sequential allows us to stack a layered network in which we add the Dense layers afterward. Within the layer, we can add more parameters such as kernel/bias/activity initializer and constraint but in this case, we only put a regularizer and activation function in each layer. Then we configure (compile) the model by passing 3 parameters (optimizers, loss function to minimize, and metrics to evaluate) before training the model using the fit function and inputting our training set and validation set.

From the code above, you might notice that we put an additional parameter called params. What we do here is that instead of arbitrarily putting some hyperparameter value in our architecture, we want to find the value that more likely gives us the best result (in this case, accuracy) by optimizing it with Talos library; you can find more details about Talos here. Simply put, Talos runs the model with all the combinations of hyperparameters we previously assign then output the performance comparison. We find it helpful as we do not have to manually rerun the model each time we try hyperparameter tuning.

We can tune any hyperparameters with as many values. But for simplicity, we just put 2 values for each hyperparameter; first neuron (5 or 10), second_neuron (3 or 5), activation (Rectified Linear Unit or Hyperbolic Tangent), and optimizer (Adam or RMSprop), as shown in the code below. The model will be run with 16 possible scenarios.

Let Talos do the repetitive job

In my computer, it took roughly 22 minutes to finish running the model with all 16 hyperparameter combinations. You could make it shorter or longer by playing around with the batch size, epochs, the number of hyperparameters and its value, and so on (however, it also might impact the model performance). Finally, Talos would output something like this:

Performance report of each scenario

Based on the report above, it seems that the more nodes we have in the first layer, the better the performance gets; this could be a hint for further tuning. The network with 10 first neurons, 5 second neurons, ReLU activation function, and Adam optimizer leads to a better result with 79.2% accuracy.

Now we can make the prediction with the model we just created by calling the model.predict() and model.predict_classes() to see the churn probability and prediction, respectively. The final result (shown below) would hopefully help the team to understand who to target and when to launch the features to retain the users.

The final result

TL;DR and Further Approach

In this article, we have done:

  • Standard EDA (Visualizing raw data, getting rid of outliers, inspecting correlation)
  • Data preparation (Converting categorical variables into numerical, scaling the data)
  • Building the fully-connected neural network using Keras functional API
  • Finding the optimal value for hyperparameters using Talos
  • Obtaining the churn probability and predictions

We have implemented a binary classifier using a regular neural network. There are indeed lots of room for model improvement ranging from adding more features that better track user in-app behavior, handling outliers more carefully (as they might contain useful information), advanced feature engineering, reducing dimensionality, to spending more time in working on architecture (adding dropout, adjusting initializers/regularizers, early stopping) and hyperparameter tuning. Of course, your feedback and suggestions are highly appreciated.

--

--