Customer Relationship Prediction : KDD Cup 2009

17 min readJan 27, 2020

1. INTRODUCTION

Customer relationship management (CRM) is a technology for managing all your company’s relationships and interactions with customers and potential customers. The goal is simple: Improve business relationships. A CRM system helps companies stay connected to customers, streamline processes, and improve profitability.

When people talk about CRM, they are usually referring to a CRM system, a tool that helps with contact management, sales management, productivity, and more.

A CRM solution helps you focus on your organization’s relationships with individual people — including customers, service users, colleagues, or suppliers — throughout your lifecycle with them, including finding new customers, winning their business, and providing support and additional services throughout the relationship.

The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.

2. PROBLEM STATEMENT

The KDD Cup 2009 challenge was to predict, from customer data provided by the French Telecom company Orange, the urge of customers to switch providers (churn), buy new products or services (appetency), or buy upgrades or add-ons (up-selling).

The task is to estimate the churn, appetency and up-selling probability of customers, hence there are three target values to be predicted.

Churn Rate/Attrition Rate: Customer churn is the percentage of customers that stopped using your company’s product or service during a certain time frame. You can calculate churn rate by dividing the number of customers you lost during that time period — say a quarter — by the number of customers you had at the beginning of that time period. For example, if you start your quarter with 400 customers and end with 380, your churn rate is 5% because you lost 5% of your customers.
Appetency : Appetency means a strong desire of wanting something. Therefore, appentency of a customer is the inclination or natural tendency to long for services or products of any particular brand or company. For example, iPhone fans forming a line outside the Apple stores in the month of September every year to get their hands on the smartphone as soon as possible.
Up-selling: Up-selling is the practice of giving customers the option to buy an item that is slightly better than the one they are considering. To generate an up-sale, a salesperson may offer a more-expensive product, suggests an upgrade or convince the customer to purchase add-ons. For example, an airline prompts the passenger flying coach to upgrade to a first-class seat as part of the airline check-in process.

3. DATASET

There are two versions of dataset made available for the same problem, the difference being their sizes. Each of the version has a training and a testing dataset; both training and testing data contain 50000 examples. The training and testing data of larger version have 15000 columns, while that of smaller version contain only 230 columns. In the smaller dataset, the first 190 variables are numerical and the last 40 are categorical. For the sake of computation, the smaller dataset has been used. The datasets are all available at https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data which also includes the appentency, churn and upselling data(.labels format).

Data Format:

One header lines with the variables names
One line per instance
There are missing values

Labels:

The target values (.labels files) have one example per line in the same order as the corresponding data files.
Churn, appetency, and up-selling are three separate binary classification problems. The target values are +1 or -1. We refer to examples having +1 (resp. -1) target values as positive (resp. negative) examples.

4. MAPPING THE REAL-WORLD PROBLEM TO MACHINE LEARNING PROBLEM

It is a classification based problem containing two classes +1 and -1 for each of churn, appetency and up-selling.

Evaluation metrics

The performances are evaluated according to the arithmetic mean of the AUC for churn, appetency. and up-selling. This is known as the “Score” .

| The goal is to get the best possible AUC Score .

5. EXPLORATORY DATA ANALYSIS

First we load the provided data set into our kernel.

10 Instances of the loaded train data (train_data.head(10))

Just by going through the first 10 instances, we can say that this data contains numerous NaN values for both categorical and numerical variable, and we will find a way out from this problem.

Below is the plot for percentile values from 1–100 range with a skip of 10 i.e 10,20,30..90 percentile values are plotted for the feature of Var73, same has been done for other features also.

This EDA was done to observed how well the variables of the features are distributed w.r.t class labels which is -1 and 1 and can we distinguish by looking at feature if it belongs to class label -1 or 1.But by looking at boxplot hardly we can say anything as many of variable for both class label -1 and 1 have high proportion of collision.

Analysis of the missing values:

First of all, I tried to find columns or features in the data set which contains only NaN values. On performing the analysis , I found 18 columns in the data set which contains only NaN values in all its rows. Because, these columns are all empty, they are useless. So, I dropped all the 18 columns.
Next, I tried to plot a graph between number of NaN values and its frequency, in order to get an idea about its distribution in the data set. I got the following graph :

From the above graph, I came to know that there are still many columns which have above 40k NaN values (out of 50k data). I decided to drop unimportant columns from the data. These columns include the ones with 70% or more NaN values. There were 156 such columns and on dropping them, I was left with 74 columns.

Next, I loaded the churn, appetency and up_selling labels data to analyse them.

Loading of churn, appetency and upselling labels data

Then, I found the frequencies of 1’s and -1’s in the labels data by plotting a graph.

Frequency of labels (-1, 1) in appetency, churn and upselling data

From this graph, it was clear that the dataset was Imbalanced (A dataset in which the number of datapoints which belong to both the classes are unequal).

Analysis of Unique Values:

After getting a brief idea of the existing NaN values in the dataset, I tried to analyse the unique value occurrence in each feature of the dataset.

Lineplot of occurrence of unique values in each feature

From the above plot, I observed the extreme variation in terms of the number of unique values in every feature.

6. SPLITTING OF THE DATA

Later, I split the dataset into training and testing dataset in the following composition:

80% of the dataset for training the model.
20% of the dataset for testing and evaluation of the model.

Train data and Test data shape of appetency, churn and upselling

7. FEATURE ENGINEERING

Next, we have to perform the feature engineering. Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. If correct features are created from raw data , it increases the predictive power of machine learning algorithms.

In this project, I have extracted two additional features and they are as follows:

Number of NaN values in each row : the value in this feature gives the count of NaN values present in each instance.

Binary feature to indicate presence or absence of NaN value for a feature in each instance : 0 indicates absence and 1 indicates presence of NaN values.

A new feature that tells about the presence or absence of NaNs

Encoding of top 5 (or even 10) variables of each categorical features: to avoid explosion of features which is mentioned in the IBM Research Paper.

Encoding of top 5 values of categorical features

Decision Tree: Here you can recode each feature by training a decision tree of limited depth (for eg. 2, 3 or 4) using that feature alone and let the decision tree directly predict the target. The probabilistic features of this decision tree linearly correlated with the target will be further used as an additional feature.

The other feature that could be used is feature binning. I have neither used Decision Tree nor Feature Binning in this project as it was not giving an improved result.

Code to implement a Decision Tree for feature engineering

8. HANDLING MISSING VALUES

Earlier, we removed columns that contained NaN in all of its rows and columns with NaN values ≥70%. But, according to the null value analysis performed earlier and the plot obtained, we came to know that there were still columns left that had missing values. Here I will deal with those missing values.

a) Numerical Columns:

There are many strategies that can be implemented to handle missing values in numerical features. Some of the most common approaches are as follows:

Replacing NaN with the mean of that column.
Similarly, mode or median can also be used.
Forward Fill or Backward Fill.
Interpolation of Data
Replacing NaN with 0's.

For this project, I have imputed the missing values with the mean of the respective column.

Replacing NaN values with the mean of the respective column

b) Categorical Columns:

Just as there are strategies to handle missing values in numerical columns, there are a few for the categorical columns. Some of which are as follows:

Replacing the NaN with 0's.
Taking the NaN to be a separate group.

In this project, I have taken the NaN as a different group.

9. HANDLING CATEGORICAL DATA

In many practical data science activities, the data set will contain categorical variables. These variables are typically stored as text values. (Practical Business Python) Since machine learning is based on mathematical equations, it would cause a problem when we keep categorical variables as is. There are algorithms that do not support categorical values and in that case, we are left with encoding methodologies. Some of the most prominent methods are:

Encoding Methodologies

https://www.datacamp.com/community/tutorials/encoding-methodologies

Label encoding

In label encoding, we map each category to a number or a label. The labels chosen for the categories have no relationship. So categories that have some ties or are close to each other lose such information after encoding.

Frequency encoding

It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data.

One — hot encoding

In this method, we map each category to a vector that contains 1 and 0 denoting the presence of the feature or not. The number of vectors depends on the categories which we want to keep. For high cardinality features, this method produces a lot of columns that slows down the learning significantly.

Target or Impact or Likelihood encoding

Target Encoding is similar to label encoding, except here labels are correlated directly with the target. This encoding method brings out the relation between similar categories, but the relations are bounded within the categories and target itself.

For this project, I have used Frequency encoding.

Frequency encoding of the categorical features

10. NORMALIZATION OF THE DATA

Normalization is a data preparation technique in machine learning. It is used to change the values of numeric columns in the dataset and bring them under a common scale to remove the distortion in data.

There a number of normalization techniques available directly using sklearn and here I have used the MinMaxScaler.

Normalisation of appetency data using MinMaxScaler

11. FEATURE SELECTION

The process of selecting the features that contribute the most in a machine learning prediction from a set of features is known as feature selection. Selecting wrong features can majorly effect a prediction in a negative way. There are various strategies that can be employed for feature selection process, so selecting the right strategy is a key factor.

Feature Selection Strategies

Ridge Regression and LASSO

Ridge regression performs ‘L2 regularization‘, i.e. it adds a factor of sum of squares of coefficients in the optimization objective.

Lasso regression performs L1 regularization, i.e. it adds a factor of sum of absolute value of coefficients in the optimization objective.

For detailed study, click here.

Variance Threshold

It works on the principle that data with low variance contains low information. First variance of every feature is calculated and then columns whose variance are less than a certain threshold are dropped. This should be applied only on scaled data.

For more info, click here.

Chi-squared score

Chi-squared tests if there is a significant difference between the observed and the expected frequencies of 2 categorical variables.

For more info, click here.

Fisher Score

Fisher Score selects each feature independently according to their scores under the Fisher criterion, which leads to a suboptimal subset of features. It can only be used for categorical data only and typically used for binary classification.

Extremely Randomized Trees

This is a more advanced version of random forest where the randomness in the splits go up by a notch. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias.

Here, I have used Lasso Regression.

Using Lasso Regression for feature selection

Initially, I took an arbitrary set of alpha values and with the help lineplot, plotted a Alpha vs AUC Score graph to have a good idea regarding the correctness of the parameter.

NOTE: As we can see from above plot there is a quick fall, now we are going to zoom and find best alpha between 0 and 0.0005

Improved Lasso with alpha values in the range 0 and 0.0005

Zoomed in version of the above graph (Alpha vs AUC Score)

12. IMPLEMENTATION OF MACHINE LEARNING MODELS

For this project, I have taken the help of two machine learning algorithms to serve my purpose. The implementation of these algorithms are as follows:

12.1 Random Forest Classifier:

Random Forest Classifier is a supervised learning algorithm which is commonly used for classification purposes. A random forest, as the name implies is a forest of many decision trees that make prediction on their own and at the end the best result is selection based on voting. It is preferred over decision trees as it reduces over-fitting by averaging the result. It is an ensemble method. For better understanding of RF Classifier, click here.

https://www.researchgate.net/figure/The-flowchart-of-random-forest-RF-for-regression-adapted-from-Rodriguez-Galiano-et_fig3_3 — A Random Forest Classifier

We have to apply Random Forest for all the three label data (churn, appetency and upselling) and find the mean of their individual AUC Scores.

First of all I performed hyperparameter tuning (Grid Search) to find out the best parameters.

Grid Search to tune the hyperparameters on RF model

Next I created a model with the best hyper-parameters obtained from the grid search.

RF with best parameters after hyperparameter tuning

Then , I evaluated the model and plotted graphs of the best parameters vs the AUC_Score obtained during training and testing.
Later, I employed Lasso Regression to select the most appropriate features, applied grid search on such a model and evaluated the same. But, I found out that the result obtained without feature selection process was better. So, I did not use Lasso Regression later.

a. Appetency Data:

Before Feature Selection:

Best parameters vs AUC_Score (Appetency)

Scores before applying Lasso Regression
 
AUC ROC Score on train data is 0.8959841927889465
AUC ROC Score on test data is 0.8367474758567672

After Feature Selection:

Score after applying Lasso RegressionAUC ROC Score on train data is 0.8100827274646003
AUC ROC Score on test data is 0.8164073561409673

Three Conclusions:

As we can see, results get even worse after feature selection.
It seems like n_estimators do not have much contribution to the results i.e. even if we increase n_estimators, there won’t be much increase in the score.
As usual if we increase the depth, overfitting increases.

b. Churning Data:

ScoresAUC ROC Score on test data is 0.7001838193130551
AUC ROC Score on train data is 0.7346710049897438

c. Upselling Data:

Best parameters vs AUC_Score (Upselling)

ScoresAUC ROC Score on test data is 0.8379723168694151
AUC ROC Score on train data is 0.8737326756484024

12.2 eXtreme Gradient Boosting (XGBoost) Classifier:

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. For in depth understanding of XGBoost, click here.

NOTE: An important thing to remember is that the data we are working with is highly imbalanced. So, not all algorithms are going to be effective. Random Forest (or even Decision Trees, XGBoost) can work well with such data, but algorithms like MLP (Multi Layer Perceptrons) will not work well.

Here are a few things you can do to work with such imbalanced data (or click here):

Apply appropriate Resampling Techniques.

Upsampling Minority class: adding more samples of minority class. Should be used when you do not have much data to work with. Causes overfitting and poor generalization, if applied before splitting the data to train and test.
Downsampling Majority class: removing samples from the majority class. Should be used when working with large amout of data. Causes underfitting and poor generalization with the removal of valuable data.

2. Choose appropriate algorithms to work with.

This is an alternative if you want to avoid performing resampling. Using cost sensitive algorithms like Decision Trees or Random Forests suits well for such data.

3. Choose the right performance metric.

When you have an imbalanced data, using a metric like accuracy is not going to help. You can easily score exactly as the number of majority data, just by rightly classifying a majority label. Metrics that provide a better insight in such cases are:

AUC Score: the measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0's as 0's and 1's as 1's.
F1 Score: the weighted average of precision and recall. (You can also use F1 Score for this project)
Confusion Matrix: a table showing correct predictions and types of incorrect predictions.
Precision: the number of true positives divided by all positive predictions. Precision is also called Positive Predictive Value. It is a measure of a classifier’s exactness. Low precision indicates a high number of false positives.
Recall: the number of true positives divided by the number of positive values in the test data. Recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s completeness. Low recall indicates a high number of false negatives.

Just as I applied RF Classifier on all the three label data, we will have to do the same with XGBoost Classifier and find the mean of the AUC_Scores.

First of all I performed hyperparameter tuning (Grid Search) to find out the best parameters.

Grid Search to tune the hyperparameters on XGB classifier

Next I created a model with the best hyper-parameters obtained from the grid search.

Finally, I evaluated the model and plotted graphs of the best parameters vs the AUC_Score obtained during training and testing.

a. Appetency Data:

ScoresAUC ROC Score on test data is 0.8210211998288638
AUC ROC Score on train data is 0.8338440390638763

b. Churn Data:

ScoresAUC ROC Score on test data is 0.6965773467324506
AUC ROC Score on train data is 0.7560090373832761

c. Upselling Data:

ScoresAUC ROC Score on test data is 0.8456694656031765
AUC ROC Score on train data is 0.8664720890825892

13. COMPARISON BETWEEN THE PERFORMANCE OF RF AND XGBoost

Some important conclusions that were inferred from this project:

From the results obtained, we can say that the models are not overfitting and able to predict or address upselling situation.
The machine learning model after some feature engineering was unable to predict the churning data points well.
The machined learning models performed quite good for appetency and upselling data.
The average AUC score obtained for RF Classifier (and even the XGBoost) was low (or lower than expected) because of the results from the churn data (models performed poorly) which is 0.70, but performed well for appetency with auc score of 0.84 and upsell with auc score of 0.846. The final (average) auc score was 0.795.
We have also tried MLPclassifier with upsampling techniques (on train data) to predict the data points of all the three classes appetency,churning and upselling.But it performance was even worse than random forest and xgboost, not able to predict the minority class which is (1).

Here, I have used pretty table to generate the table that encompasses the comparison data. Click here to use pretty table for your own project.

Comparison between the AUC Scores of XGboost and RF

The best performer is the Random Forest Classifier with a Score of 0.795. While, XGBoost has also performed fairly well with a Score of 0.787.

Customer Relationship Prediction : KDD Cup 2009

1. INTRODUCTION

2. PROBLEM STATEMENT

3. DATASET

Data Format:

Labels:

4. MAPPING THE REAL-WORLD PROBLEM TO MACHINE LEARNING PROBLEM

Evaluation metrics

5. EXPLORATORY DATA ANALYSIS

Analysis of the missing values:

Analysis of Unique Values:

6. SPLITTING OF THE DATA

7. FEATURE ENGINEERING

8. HANDLING MISSING VALUES

a) Numerical Columns:

b) Categorical Columns:

9. HANDLING CATEGORICAL DATA

Encoding Methodologies

Label encoding

Frequency encoding

One — hot encoding

Target or Impact or Likelihood encoding

10. NORMALIZATION OF THE DATA

11. FEATURE SELECTION

Feature Selection Strategies

Ridge Regression and LASSO

Variance Threshold

Chi-squared score

Fisher Score

Extremely Randomized Trees

12. IMPLEMENTATION OF MACHINE LEARNING MODELS

12.1 Random Forest Classifier:

a. Appetency Data:

b. Churning Data:

c. Upselling Data:

12.2 eXtreme Gradient Boosting (XGBoost) Classifier:

a. Appetency Data:

b. Churn Data:

c. Upselling Data:

13. COMPARISON BETWEEN THE PERFORMANCE OF RF AND XGBoost

14. REFERENCES

Written by Vishnu D. Pathak