Understanding Customer Churning

Published in

The Startup

7 min readAug 7, 2020

Big Data Analytics within a real-life example of digital music service

Customer churn is a key predictor of the long-term success or failure of a business. It is the rate at which customers are leaving your business and taking their subscription dollars elsewhere. For every single business, why the users churn and how to change, keep, attract the users is the forever questions they ask themselves.

Digital Music Service, as an example for us here to look into. Let’s think of the most familiar platform, like Spotify, Pandora. Every time when you, as the user interact with the service, every small step, such as playing music, logging out the page, like the song, etc, generate the data. Here comes the Big Data! All these data contain the key insight for predicting the churn of the users and keeping the business thrive. Because of the size of the data, it is a challenging and common problem that we regularly encounter in any customer-facing business.

Here we are going to analyze the real-life large datasets for a music streaming service with Spark. We attempt to build machine learning models to predict the churning possibilities of the users and understand the features that contribute to the churning behaviors.

Let’s start with a mini-subset (~128MB) of the large data (12 GB) first for understanding and exploring the datasets. We will load in our dataset (JSON format) through the following commands:

# Create a Spark session
spark = (SparkSession.builder
                     .master(“local”) 
                     .appName(“sparkify”) 
                     .getOrCreate())# Read the dataset
events_df = spark.read.json(‘mini_sparkify_event_data.json’)

We can also take a look at the shortcut of all the features and their datatype

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)

The feature page seems to be the most important one as it records all the user interactions. The page column recorded values, such as Logout, Save Settings, Roll Advert, Settings, Submit Upgrade, Cancellation Confirmation, Add Friends, etc. Also, the Cancellation Confirmation events of page define the churn that we are interested in. (0 as un-churn, and 1 as churn)

Exploratory Data Analysis (EDA)

We want to perform some exploratory data analysis to observe the behavior for users who stayed vs users who churned.

From the bar plot on the left, the average length of songs played for churn and un-churn users is generated. For un-churned users, they have longer mean length for listening to the songs compare to the other group.

The second bar chart shows the relationship of the churn rate and User-Agent of the users. From the data, we can conclude that X11 and iPhone users tend to churn more and this can give us some insights for further investigation of the systems.

By checking the correlation matrix of the page and our domain knowledge, we pick several features (Thumbs Up, Thumbs Down, Add Friend, Add to Playlist, Error, Help) to observe the difference between churn and un-churn customers. The box plot below shows some detailed information.

What can you gain from the plots? From my perspective, churn users:

less likely to click thumbs up
less likely to add friends
less likely to add songs to the playlist

However, it doesn’t necessarily mean that they have more errors encountered and need more help from the service.

Once we familiarized ourselves with the data, let’s build out the features find promising to train the model on.

Feature Engineering

Here are some features that I found interesting:

Features of Page but remove un-related ones

df_features = df.groupby([‘userId’])
                .pivot(‘page’)
                .count()
                .fillna(0)df_features.withColumnRenamed(‘Cancellation Confirmation’,’Churn’)df_features = df_features.drop(‘About’, ‘Cancel’, ‘Login’,’Logout’, ‘Roll Advert’, ‘Submit Registration’, ‘Register’, ‘Save Settings’)

2. Total song-length of the user listened

total_length = df.filter(df.page == ‘NextSong’)
                 .groupby(df.userId)
                 .agg(sum(df.length)
                 .alias(‘total_songlength’))df_features = df_features.join(total_length, on=[‘userId’], how=’inner’)

3. Gender: Dummy variables created

gender_df = df.select(‘userId’,’gender’).dropDuplicates()categories = gender_df.select(‘gender’)
                      .distinct()
                      .rdd.flatMap(lambda x: x)
                      .collect()exprs = [F.when(F.col(‘gender’) == category, 1)
          .otherwise(0)
          .alias(category) for category in categories]gender_df = gender_df.select(‘userId’, *exprs)df_features = df_features.join(gender_df, on=[‘userId’], how=’inner’)

4. Number of days user active

days = df.groupby(‘userId’).agg(max(df.ts),(min(df.ts)))days = days.withColumn(‘days_active’, (col(‘max(ts)’) -col(‘min(ts)’)) / (60*60*24) )df_features = df_features.join(days, on=[‘userId’], how=’inner’).drop(‘max(ts)’,’min(ts)’)

5. Number of days register the account

days_reg = df.groupby(‘userId’)
             .agg(max(df.registration),(max(df.ts)))days_reg = days_reg.withColumn(‘days_register’, (col(‘max(ts)’) -col(‘max(registration)’)) / (60*60*24) )df_features = df_features.join(days_reg, on=[‘userId’], how=’inner’).drop(‘max(ts)’,’max(registration)’)

6. The final level of the user (paid/free)

final_level = df.groupby(‘userId’, ‘level’)
                .agg(max(df.ts)
                .alias(‘finalTime’))
                .sort(“userId”)categories = final_level.select(‘level’)
                        .distinct()
                        .rdd.flatMap(lambda x: x)
                        .collect()exprs = [F.when(F.col(‘level’) == category, 1)
          .otherwise(0)
          .alias(category) for category in categories]final_level = final_level.select(‘userId’, *exprs)

Modeling

After we engineered the features, we will build three models: logistic regression, random forest, gradient boosting trees. Let’s start by generating the table, splitting, and scale the data.

# Rename Cancellation Confirmation as label in df_features_label
df_features_label = df_features.withColumnRenamed(‘Cancellation Confirmation’, ‘label’)# Generate features table
df_features = df_features.drop(‘Cancellation Confirmation’, ‘userId’)# Splitting the data
train, test = df_features_label.randomSplit([0.8, 0.2])# Instantiating vectorassembler for creating pipeline
vector_assembler = VectorAssembler(inputCols = df_features.columns, outputCol = ‘Features’)# Scale each column for creating pipeline
scale_df = StandardScaler(inputCol = ‘Features’, outputCol=’ScaledFeatures’)

Here we give an example of building the Logistic Regression Model. All the other models are similar methods to build.

lr = LogisticRegression(featuresCol=”ScaledFeatures”, labelCol=”label”, maxIter=10, regParam=0.01)# Creating pipeline
pipeline_lr = Pipeline(stages=[vector_assembler, scale_df, lr])# fitting the model
model_lr = pipeline_lr.fit(train)

In order to evaluate the accuracy of the model, we write a function to report results on the validation set. Since the churned users are a fairly small subset, and F1 score as the metric to optimize.

def peformance(model, data, evaluation_metric):
    # Generate predictions
    evaluator = MulticlassClassificationEvaluator(metricName =  evaluation_metric)
    
    predictions = model.transform(data)
 
    # Get scores
    score = evaluator.evaluate(predictions)
 
    confusion_matrix = (predictions.groupby(‘label’)
                                   .pivot(‘prediction’)
                                   .count()
                                   .toPandas()) 
 
    return score, confusion_matrix

We check the performance of the model as follows:

# Performance 
score_lr, confusion_matrix_lr = peformance(model_lr, test, ‘f1’)print(‘The f1 score for Logistic Regression model:{}’.format(score_lr))print(confusion_matrix_lr)

Here is the resulting output for the Logistic Regression Model:

From the analysis, the Gradient Boosting Tree Model did the best job with an F-1 score of up to 0.88. We need to notice that, since we only have a small group of people churn in the business usually, we care more about we can identify the churned users correctly, instead of pursuing high overall performance. In this case, we didn’t perform the grid searching and tune the parameters.

Feature Importance

Using our best GBT model and feature importance function, we visualize the relative importance rank of each feature we obtained in the feature engineering process. As the figure below, we find that the days of the active, register of the users, and the number of times users add the song to the playlist are the most important features to the GBT model we built.

What actions we can take to decrease the churn rate then?

By finishing analyzing the data is never the end, always how to apply to the business is the most important part and the part makes our model crucial. With the feature importance we gained, we can come up with some business strategies to counter customer churns in real-life business. Here are some brief ideas related to our analysis:

The number of active days is one of the important factors for churning, then rewarding and discounting can be considered to attract the activity of the users. This can also apply to the adding friends’ system, for example, if the user recommend and add 5 friends in the community, they can unblock unique badge

Wow! We are finally here! Do you still remember what did we do to use Big Data methods in order to find out the churn behaviors of the customers?

Let’s do a recap:

data loading
Exploratory data analysis
Feature engineering
Model building and evaluation
Identifying important features
Business Strategy (Actions)

If you are interested in more details of these procedures, you could check out my entire code for this Sparkify analysis at my GitHub repository.

Hope you enjoy reading this long blog and learn the strategies booming your business in Data Science Way!

Understanding Customer Churning

Written by Jessie Shao