Customer Relationship Prediction : KDD Cup 2009

KUSHAL CHAKRABORTY
14 min readJun 28, 2019

--

( A Machine Learning Case Study )

Customer Relationship Management (CRM)

1. Business Problem

Description

Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.

The challenge is to beat the in-house system developed by Orange Labs. It is an opportunity to prove that you can deal with a very large database, including heterogeneous noisy data (numerical and categorical variables), and unbalanced class distributions. Time efficiency is often a crucial point. Therefore part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly.

Problem Statement

The task is to estimate the churn, appetency and up-selling probability of customers, hence there are three target values to be predicted. The challenge is staged in phases to test the rapidity with which each team is able to produce results. A large number of variables (15,000) is made available for prediction. However, to engage participants having access to less computing power, a smaller version of the data set with only 230 variables will be made available in the second part of the challenge.

  • Churn (wikipedia definition) : Churn rate is also sometimes called attrition rate. It is one of two primary factors that determine the steady-state level of customers a business will support. In its broadest sense, churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time. The term is used in many contexts, but is most widely applied in business with respect to a contractual customer base. For instance, it is an important factor for any business with a subscriber-based service model, including mobile telephone networks and pay TV operators. The term is also used to refer to participant turnover in peer-to-peer networks.
  • Appetency : In our context, the appetency is the propensity to buy a service or a product.
  • Up-selling (wikipedia definition): Up-selling is a sales technique whereby a salesman attempts to have the customer purchase more expensive items, upgrades, or other add-ons in an attempt to make a more profitable sale. Up-selling usually involves marketing more profitable services or products, but up-selling can also be simply exposing the customer to other options he or she may not have considered previously. Up-selling can imply selling something additional, or selling something that is more profitable or otherwise preferable for the seller instead of the original sale.

2. Data

Data Overview

  • Source : https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data
  • There are two data set files : orange_small_train_data.zip and orange_small_test_data.zip .
  • Both training and test sets contain 50,000 examples.
  • Data sets have numerical and categorical variables.
  • The first 190 variables are numerical and the last 40 are categorical

3. Mapping the real-world problem to Machine Learning Problem

It is a classification based problem containing two classes +1 and -1 for each of churn, appetency and up-selling.

Evaluation metrics

The performances are evaluated according to the arithmetic mean of the AUC for the three tasks (churn, appetency. and up-selling). This is known as the “Score” .

| Our goal is to get the best possible AUC Score possible.

4. Exploratory Data Analysis

First we load the provided data set into our kernel.

Format of the provided Dataset.(Only first five instances are shown in this image)

By printing only the head of the data set we can understand that the data set may contain significant amount of NaN values. So it is very important to find some imputation methods to deal with these NaN values.

Analysis of NaN values in the data set

First we have checked whether there any columns or features in the data set which contains only NaN values. On performing the analysis , we observed that :

  • There are 18 columns in the data set which contains only NaN values in all its rows.

These 18 columns are useless features for our model, so we need to eliminate them.

Now we plot a graph between features and number of NaN values in order to get an idea about the distribution of NaN values in the data set. We obtain the following graph :

Graph of Features vs Number of NaN values

From the above image we have the following observations :

  • Even after removing columns containing only NAN values, there are still NaN values left in the data set.
  • On analysing the above graphs for train and test data sets we can infer, that there are still many features which contain more than 47500 NAN values, which is a very significant number.

After this we also checked that whether there are any instances or rows which contain only NaN values in all it’s columns/features ,but we did not find any such rows.

Loading and Analysing the Target Variables

Next we load the labels for each of the target variables :Churn, Appetency, Up-selling

Load labels into the kernel

Checking the type of the data set (Balanced / Imbalanced)

Now we check whether the data set is a balanced or imbalanced data set.

Balanced Dataset : It is a dataset in which the number of datapoints which belong to both the classes are equal.

Imbalanced Dataset : It is a dataset in which the number of datapoints which belong to both the classes are unequal.

We perform the following analysis :

Number of datapoints of each class present in Appetency, Churn, Up-selling

We plot the following graphs to perform the analysis :

Number of datapoints of each class in Appetency
Number of datapoints of each class in Churn
Number of datapoints of each class in Up-Selling

From the above analysis we can have the following observations :

  • The above histograms clearly shows the imbalances among data in all the classes.
  • All the analysis done above clearly shows that the dataset is a highly imbalanced dataset.

Analysing the number of unique values in the dataset

Now we will analyse the number of unique values present in each feature of the dataset.

Line plot of number of unique values in each feature of the dataset
Bar plot of number of unique values in each feature of the dataset

From the above two plots we can observe that :

  • There is a lot of variation in the number of unique values owned by each feature.

5. Deletion of unimportant features from the dataset

Through unimportant we are referring to those features which contain too many NaN values. After a lot of research and experimentation, we decided to delete those features which contain > 70% of missing or NaN values.

We found out that there were 156 such features. So we deleted those features.

6. Splitting of the dataset into Train and Test dataset

We split the entire dataset into train and test dataset in the following percentage :

  • 80% of the dataset is used for training the model.
  • Remaining 20% of the dataset is used for testing and evaluation of the model.

7. Feature Engineering

We have extracted some extra features from existing features to improve the performance of the model. The extra features are as follows :

  • Number of NaN values in each row : In this feature we just count the number of NaN values present in each instance of the dataset.
  • Binary feature to indicate presence or absence of NaN value for a feature in each instance : In this we put the value 0, if NaN is not present for the feature for the particular datapoint, otherwise 1.

8. Handling Numerical Features

After removing so many NaN values in the dataset as mentioned in the above sections, we still have some NaN values left both in categorical and numerical features. We can replace the NaN values in numerical features with various values :

a) A very common approach would be to replace the NaN values with the mean of the respective columns.

b) Another common approach would be to replace the NaN values with 0 s.

c) Replace NaN values with median or mode.

d) Replace NaN values with min(of respective columns) -1.

e) Replace NaN values with max(of respective columns) +1.

After performing various research and experimentation we found out that replacing NaN values with max(of respective columns) +1 (i.e (e))provided the best auc score.

9. Handling Categorical Features

We can handle the NaN values in categorical features in following ways :

a) We can consider NaN to be a separate category.

b) We can replace NaN values with 0 s.

In this case study, the second option was more convenient to use and gave better results.

10. Encoding of Categorical Features

Before providing the data as input to the model we must encode the various features.

There are various ways to encode Categorical Features. Some of them are :

  • One Hot Encoding : In this a binary value is added depending on the presence or absence of each category. To understand it in details, you can refer here. We have not used this encoding here because it will create a sparse matrix with very high dimensions which is very difficult to handle and can cause “Memory Error” for this case study.
  • Hashing Encoding : It is a categorical encoding technique. To understand it in details, you can refer here. We have not used it here because for this dataset we need to have very large amount of memory (approx ≥ 32 GB RAM ) otherwise it may cause memory issues.
  • Frequency Encoding : It is a categorical encoding technique in which we replace the categories with their respective frequencies. To understand it in details, you can refer here. We have used this encoding for our dataset. It is advantageous to use because in order to implement this encoding, it is not required to have a very high configuration resources.

11. Standardisation of Numerical Features

Next we have scaled and standardised all the numerical features as there is a great variation of the data in numerical features.

12. Feature Selection

In this section we will be selecting some important features which can improve the performance of the model. We have already removed some features in the previous sections. There are different kinds of feature selection techniques. We have tried some of them so that we can improve the AUC score of the models.

We have tried the following strategies for feature selection :

  • Variance Threshold: It is used to select features with variance above certain threshold. To understand the concept with more clarity, you can refer here.
  • Lasso Regression : It is used to select features with non-null coefficient in fitted regression with L1 regularization. To understand the concept with more clarity, you can refer the following links: Link1, Link2, Link3.
  • Support Vector Machines : It can be used to reduce the feature space to a great extent, but the main disadvantage of it is that it consumes a large amount of memory for it’s execution. To understand the concept with more clarity, you can refer the following links: Link1, Link2.
  • Extremely Randomised Trees : It can be used to compute feature importances, which in turn can be used to discard irrelevant features.To understand the concept with more clarity, you can refer the following links: Link1, Link2,Link3.

We have performed various experiments and have observed that Feature Selection using Lasso Regression provides the best AUC Score.

13. Implementation of Various Machine Learning Models

13.1 Gradient Boosting Classifier

a) Churn Rate

In this case we use the Churn Rate labels.

  1. First we perform hyper-parameter tuning using cross-validation to find out the best hyper-parameters.
Hyper-parameter Tuning using Cross-validation

2. Next we create the model with the best hyper-parameters.

3. Now we evaluate the performance of the model.

Evaluation of the model

4. As we have seen that features selected using Lasso Regression gives a better AUC score ,we have repeated the steps 1),2) and 3) with the selected features of Lasso Regression and have obtained better AUC Score.

Evaluation of the model using the selected features of Lasso Regression

5. Results

Final Test AUC Score : 0.7602147

We have obtained the following ROC AUC plot.

ROC AUC Plot for Train and Test data

b) Appetency

In this case we use Appetency labels.

Note: We have followed the same sequence of steps as that in churn rate. So here we will directly discuss the results obtained.

Results:

Final Test AUC Score : 0.882946419344

We have obtained the following ROC AUC plot.

ROC AUC Plot for Train and Test data

c) Up-Selling

In this case we use Up-Selling labels.

Note: We have followed the same sequence of steps as that in churn rate. So here we will directly discuss the results obtained.

Results:

Final Test AUC Score : 0.90323489744

We have obtained the following ROC AUC plot.

ROC AUC Plot for Train and Test data

13.2 XGBoost Classifier

a) Churn Rate

  1. First we perform hyper-parameter tuning using cross-validation to find out the best hyper-parameters.
Hyper-parameter Tuning using Cross-validation

2. Next we create the model with the best hyper-parameters.

3. Now we evaluate the performance of the model.

Evaluation of the model

4. As we have seen that features selected using Lasso Regression gives a better AUC score ,we have repeated the steps 1),2) and 3) with the selected features of Lasso Regression and have obtained better AUC Score.

Evaluation of the model using the selected features of Lasso Regression

5. Results :

Final Test AUC Score : 0.756067503868

We have obtained the following ROC AUC plot.

ROC AUC Plot for Train and Test data

b) Appetency

In this case we use Appetency labels.

Note: We have followed the same sequence of steps as that in churn rate. So here we will directly discuss the results obtained.

Results:

Final Test AUC Score : 0.879067503868

We have obtained the following ROC AUC plot.

ROC AUC Plot for Train and Test data

c) Up-Selling

In this case we use Up-Selling labels.

Note: We have followed the same sequence of steps as that in churn rate. So here we will directly discuss the results obtained.

Results:

Final Test AUC Score : 0.899954163868

We have obtained the following ROC AUC plot.

ROC AUC Plot for Train and Test data

13.3 Logistic Regression

a) Churn Rate

  1. First we perform hyper-parameter tuning using cross-validation to find out the best hyper-parameters.
  2. Next we create the model with the best hyper-parameters.
  3. Now we evaluate the performance of the model.
  4. As we have seen that features selected using Lasso Regression gives a better AUC score ,we have repeated the steps 1),2) and 3) with the selected features of Lasso Regression and have obtained better AUC Score.
  5. Results :

Final Test AUC Score : 0.69

We have obtained the following ROC AUC plot.

ROC AUC Plot for Train and Test data

b) Appetency

In this case we use Appetency labels.

Note: We have followed the same sequence of steps as that in churn rate. So here we will directly discuss the results obtained.

Results:

Final Test AUC Score : 0.69

We have obtained the following ROC AUC plot.

ROC AUC Plot for Train and Test data

c) Up-Selling

In this case we use Up-Selling labels.

Note: We have followed the same sequence of steps as that in churn rate. So here we will directly discuss the results obtained.

Results:

Final Test AUC Score : 0.69

We have obtained the following ROC AUC plot.

ROC AUC Plot for Train and Test data

13.4 Random Forest Classifier

a) Churn Rate

  1. First we perform hyper-parameter tuning using cross-validation to find out the best hyper-parameters.
ROC AUC Plot for Train and Test data

2. Next we create the model with the best hyper-parameters.

3. Now we evaluate the performance of the model.

ROC AUC Plot for Train and Test data

4. As we have seen that features selected using Lasso Regression gives a better AUC score ,we have repeated the steps 1),2) and 3) with the selected features of Lasso Regression and have obtained better AUC Score.

Evaluation of the model using the selected features of Lasso Regression

5. Results :

Final Test AUC Score : 0.75306

We have obtained the following ROC AUC plot.

ROC AUC Plot for Train and Test data

b) Appetency

In this case we use Appetency labels.

Note: We have followed the same sequence of steps as that in churn rate. So here we will directly discuss the results obtained.

Results:

Final Test AUC Score : 0.86786

We have obtained the following ROC AUC plot.

ROC AUC Plot for Train and Test data

c) Up-Selling

In this case we use Up-Selling labels.

Note: We have followed the same sequence of steps as that in churn rate. So here we will directly discuss the results obtained.

Results:

Final Test AUC Score : 0.86786

We have obtained the following ROC AUC plot

ROC AUC Plot for Train and Test data

14. Performance Table of Various Models

Finally we compare the performance of various models in a tabular format.

Performance Table showing the AUC Scores of different Models

15. Conclusion

  1. Gradient Boosting Classifier and XGBoost Classifier are the models which have given the best performance for this dataset.
  2. Random Forest has also performed fairly well for the dataset.
  3. Logistic Regression has given the worst performance for the dataset with overfitting.
  4. Hence we can observe that ensemble learning methods performs the best for the dataset.
  5. We can also see that the Up-selling AUC is somewhat higher as compared to other target variable AUC.

16. Final Result :

Gradient Boosting Classifier has given the best performance for this dataset with a Score = 0.8487

17. References :

  1. http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
  2. https://cs.mtsu.edu/~cen/6350/kdds/Peter.pdf
  3. 2012.KDD Cup. Retrieved From: http://www.sigkdd.org/kddcup (http://www.sigkdd.org/kddcup)
  4. Niculescu-Mizil, A., Perlich, C., Swirszcz, G., Sindhwani, V.,Liu, Y., Melville, P., Wang, D., Xiao, J., Hu, J., Singh , M., Shang, W., Zhu ,Y. (2009),Winning the KDD Cup Orange Challenge with Ensemble Selection. Retrieved From: http://vikas.sindhwani.org/KDDCup-jmlr09.pdf
  5. https://www.kaggle.com/yasmino/kdd-cup-2009-data-analysis
  6. https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Intro

--

--