# Kaggle Problem Solution: WSDM- KKBox Churn Prediction Challenge

Nov 18, 2020 · 11 min read

In this blog I am highlighting my work on KKBox Churn Prediction, a Kaggle problem.

KKBox is the leading music streaming service across Asia. It was established in 2005, in Taiwan. KKBox features over 40 million legal tracks, and currently available in Taiwan, Hong Kong, Japan, Singapore and Malaysia with over 10 million users.

They are committed to create a truly immersive online music experience to the users, and to empower artists and their music through technical innovation. In 2014, KKBox launched KKBox music awards And KKBox live concerts, Which increases mass users arrival to the service.

It’s always better to follow stepwise approach, rather then simply start. So the highlights of the blog are:

3. Data
4. Evaluation Metric
5. EDA
6. Existing Solutions
7. First Cut Approach
8. Data Preparation
9. Modeling
10. Models Comparisons
11. Future Work

So let’s start,

As the title of the problem represents that , it’s a Churn Prediction problem. Which is typically a binary classification problem to determine whether a customer will churn or not churn. So for that problem first to understand “What is Churn?” “So churn typically used to represent the number of the customers/subscribers(paid) who ended the use of the product or services of a company in a particular period”.

Now the next question can be “Why is this churn rate useful?” As we know the high churn rate reduces the growth rate of the company . And according to Harvard Business School’s research which claimed that “if any company retains just 5% of its existing customers, then it results in 25% to 95% increase of profit” . And not just Harvard’s research claims this, there are a lot of giant research domains who also claim the same things.

Now it’s better time to understand the actual business problem, since we know the basics for Churn prediction. “Based on observations(data), the model has to predict whether the paid subscriber will Churn after their subscription expires or not”. For the competitions stand point KKBox also mentioned the term “Churn” . The criteria of Churn is no new valid subscription within 30 days after the current membership/subscription expires.

# 2. Business Objectives and Constraints:

• Need to penalize each misclassification.
• Some form of interpretability, so that company can proceed for further improvement.

# 3. Data:

For the perspective of this case study, I used the version 2 data only. There are 4 major comma separated files(csv’s) are given.

• train_v2.csv — This file is containing 970960 number of rows and just 2 features, msno(unique id) and is_churn(class labels).
• members_v3.csv — The file contains basic customer data. This file is containing 6769473 number of rows and 6 features, msno(unique id), city, bd(age), gender, registered_via and registration_init_time.
• transactions_v2.csv — The file contains the transactions data for all the users. This file is containing 1431009 number of rows and 9 features, msno(unique id), payment_method_id, payment_plan_days, plan_list_price, actual_amount_paid, is_auto_renew, transaction_date, membership_expire_date and is_cancel.
• user_logs_v2.csv — The file contains the day to day user data(user activity) for all the users. This file is containing 18396362 number of rows and 9 features, msno(unique id), date, num_25, num_50, num_75, num_985, num_100, num_unq and total_secs.

# 4. Evaluation Metric:

In the above mathematical formulation yi’s are the actual class labels(ground truth) and pi’s are the predicted probabilities. The log-loss can lie in between 0 and positive infinity. But the value close to 0 is consider as a good value. So with the log-loss as performance metric our main goal is to keep it as close as 0.

# 5. EDA:

• For train_v2.csv —

We can start our EDA from train_v2.csv

By looking at this plot, we can get clear picture about data imbalanceness. Here data is highly imbalanced. And not all performance metrics works well in this situation.

• For members_v3.csv —

To analyze members dataset(members_v3.csv), we have to merge it with the train dataset(train_v2.csv)

This train_members dataset contains 970960 rows and 7 features.

As we can see the dataset contain many null values.

To understand this plot we need to decode the age groups, 0 represent(0–10), 1 represent(10–20), 2 represent(20–30), 3 represent(30–40), 4 represent(40–50), 5 represent(50–60), 6 represent(60–70) and 7 represent(70–80).

Now one thing is clear there are a lot of youngsters available in this platform typically from the group 1 to group 4. Since there are lot of outliers so for this analysis I set up a range in which age should be less than 72 and greater than 0, after performing percentile based analysis.

By looking at this plot we can say from 2010 every year customers registration increases expect the year 2014(due to multiple reasons). Here we have to note down, that for the year 2017 we just know the data for first 3 month(January, February, and March).

• For transactions_v2.csv —

To analyze transactions dataset(transactions_v3.csv), we have to merge it with the train dataset(train_v2.csv)

This train_transactions dataset contains 1169418 number of rows and 10 features.

This dataset is also containing many null values.

Observations —

• There are a lot of users(almost 54%) who used 41th payment method. And also good number of users uses payment method 36, 39 and 40. But for the payment method 8, 12, 13, 22, 35, 20, 17, 15, 32 there is high churning rate.
• Almost a million of users(94.5%) purchased 30 days subscription, while only fewer and fewer users purchased rest of the plans. And beside 30 days plan, the users who purchased another plan have very high tendency to leave the service.
• Lots of users(almost 94.25%) have their plan price, like 99, 100, 129, 149, 180 NTD. Beside these five values if a user purchased any other plan then there is very high tendency of churning.
• The distribution and churning rate looks very similar to the plan price, it may possible that with the subtle difference.
• There are a lot of users(almost 89%) who settled up their account for auto renew of a plan.
• It’s clear that lot of data(almost 90%) comes from the year 2017. And rest of the transaction dates belong to 2015 and 2016. For the transaction dates 2015 and 2016 there is very high churning rate.
• Lots of users(almost 94%) whos transactions expires on 2017. And there are some other years also available, but it seems like the users who purchased long duration plans has higher chances of churn.
• There are a lot of users(94%) who didn’t cancel their subscription, and also among all those users 8% churned. But the users who cancel their subscription, have very high tendency of churning.
• For user_logs_v2.csv —

To analyze user logs dataset(user_logs_v3.csv), we have to merge it with the train dataset(train_v2.csv)

This train_logs dataset contains 18396362 number of rows and 9features.

The dataset contains a lot of null values.

All the observations shows the same thing, which is presence of a lot of outliers.

One of the key takeaway is the combined dataset is containing a lot of null values and outliers. So it’s necessary to preprocess the data first.

# 6. Existing Solutions:

He used all the data, version 1 as well as the version 2. At the end he used xgboost to achieve the minimal log-loss. The most important thing that I like about this paper is, he wrote “Feature Engineering is the key of increasing the performance of this model”. Hence I extracted a lot of features. The most important 10 features he achieved are —

He created 208 features and out of them he used 76 features for the final xgboost model. By using that model he got 0.0797484 log-loss on test data.

• 17th place solution —

He created a lot of features by the simple statistic operations, and few from the interactions. Finally he used catboost, lightGBM and xgboost model. He got 0.10614 log-loss on test data. The most important features he got are listed below.

# 7. First Cut Approach:

Next thing is the Featurization, I started with the basic statistic features and some interactive features like discount, is_discount etc. After the featurization I wanted to prepare data, which typically means removing duplicates and train test split.

Finally I wanted to try Logistic Regression, Decision Tree Classifier, and some Ensembles. And if things will not go to desired direction then I can also try Neural Network.

# 8. Data preparation:

• Feature Engineering — Here I created basic statistical features and some features from the interaction of two features. The feature engineering code look like this —
• Preparing data for modeling — Here I wanted to remove all the non useful features as well as to remove duplicate rows. The code that I used —
• Normalization — In this implementation I used min-max scaler, and more specifically custom version. Which look like this —

The other things that I did along with these is, train test split and numpy array creation for the pandas dataframe, before modeling.

# 9. Modeling:

`# best parameters after RandomizedSearchCV{'n_estimators': 500, 'min_samples_split': 3, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 15, 'bootstrap': False}`

By fitting these hyper parameters I achieved 0.15910 as private leaderboard test log-loss.

• Decision Tree Classifier — I used DTC with 5 parameters, and all of them were hyper parameter tuned. Randomized Search CV code for this is —
`# best parameters provided by RandomizedSearchCV{'splitter': 'best', 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_depth': 5, 'criterion': 'gini'}`

By fitting these hyper parameters I achieved 0.15624 as private leaderboard test log-loss.

• Logistic Regression — I used Logistic Regression with 5 basic parameters, out of them only 1 was hyper parameter tuned. Randomized search cross validation code is this —
`# best parameters provided by RandomizedSearchCV{'C': 10}`

By fitting the hyper parameters I achieved 0.15148 as private leaderboard test log-loss.

• Cat Boost — For catboost I just used 2 parameters, and I ran this for 3 boosting rounds. The implemented code is this —

By using this model I got 0.13013 on private leaderboard as test log-loss.

• XGBoost — For xgboost I used again 2 parameters(same as catboost), and ran for 12 boosting rounds. The implemented code is —

This model gives me 0.12603 on private leaderboard as test log-loss.

• LightGBM — For lightGBM I used 4 parameters with 20 boosting rounds. The code snippet is this —

Using this model I got 0.12600 on private leaderboard as test log-loss.

• Averaged Model(xgboost + lightGBM) — This model is the simple average of the prediction of xgboost and lightgbm. The code snippet related to this is —

Using this model I got 0.12562 on private leaderboard as test log-loss.

• Neural Network Model — I used a very simple Neural Network to land at top 50 in the private leaderboard. The architecture of the NN is —

By using this model I got 0.11866 on private leaderboard as test log-loss. This score falls exactly at 50th place out of 574 entries. So ultimately the score that I got is in top 10% range.

# 11. Future Work:

1. To take the entire data, I mean version 1’s too. And then trying these all models.
2. Stacking may work very well. So have to try stacking.
3. More featurization, just like Bryan Gregory, who extracted 208 features.

# 12. References:

If you want to have a look at the code, then go through this GitHub page —

And also if you want to connect with me, then this is my LinkedIn profile.

Written by

Written by