Building Knowledge on the Customer Through Machine Learning
The cost of acquiring new customers is high, so companies are spending more on customer loyalty and retention. Identifying the total value generated by a customer in the entire customer life cycle would help companies in business campaigns and in other activities. So naturally Customer Relationship Management (CRM) becomes a key element of modern marketing strategies.
If we can predict a score that allows us to project, on a given population, quantifiable information then it can be used by the information system (IS) to personalize the customer relationship.
KDD (Knowledge Discovery and Data Mining) Cup 2009 challenge consists of three tasks, predicting the churn, appentency and upselling, through the data provided by the telecom company Orange. The business idea is to :
- Identify the churning customers before they switch operator (churn),
- Identify the new potential customers for the brand (appentency),
- And identify customers who may buy something additional or more profitable items from the brand (upselling)
The challenge is to beat the in-house system developed by Orange Labs For large dataset, in-house AUC score is following:
- Churn : 0.7435
- Appentency : 0.8522
- Up-selling : 0.8975
We have two versions of the data and both have 50,000 samples but the large version contains 15,000 features and the small version contains only 230. The target variable values are +1 or -1 indicating positive and negative class labels respectively.
In the small version of the dataset 40 features are categorical with high cardinality and rest are all numerical. As per challenge rules the performance of the predictions was evaluated according to the average area under the ROC curve of three tasks; churn, appentency and upselling (collectively called score).
In the paper authors explain their approach and the different strategies they adopted that helped them win the competition. Since there was two challenges slow and fast in KDD 2009 cup, the paper summarizes the different approaches for each of them.
The data posed a number of challenges like missing values, high cardinality of categorical variables and so on. After per-processing the data and removing duplicate features they built many different classifiers.
The idea was to generate a library of base classifiers to build an ensemble classifier, which was based on the idea that some of the base classifiers will perform well either in isolation or in combination of other classifiers. They generated classifier libraries using random forests, boosted trees, logistic regression, SVMs, LibLinear, LibSVM, decision trees, Naïve Bayes, kNN, regularized least squares regression and co-clustering. These learning algorithms saw different or reduced feature sets obtained through PCA and through feature selection using filter methods based on Pearson correlation. Calibration using Platt Scaling was applied to make all base models output probabilities.
In total classifier libraries were composed of 500–1000 unoptimized individual models for each of the three tasks of problem. For ensemble selection to prevent overfitting, they adopted two methods. First was bagging — multiple ensembles were built from random subsets of classifiers and averaged together and second method was to initialize the ensemble with a set of N classifiers that have the best uni-model performance on the hillclimbing set containing data not used to train the base classifiers. The resultant test set performance was 0.75 on churn, 0.87 on appetency and 0.90 on up-selling task.
They further improved the performance by constructing 20 additional binary features corresponding to the 20 bins and used decision stumps to identify optimal splitting points using each feature. Predictions from this tree was used as an additional feature which was correlated with the target.
Binary missing indicator feature for groups of missing valued features at once was added as another set of features, which is called constant bi-clusters problem in the bio-informatics domain. Authors used probabilistic bi-clustering algorithm to identify promising bi-clusters which were then used to generate these new indicator features.
Authors attribute their success on the exploration of a large variety of learning methods, ensemble building technique and feature engineering to capture non-linearity in the data and that is what the paper summarizes on.
- Understand the data through EDA
- Missing Values Imputation
- Encode categorical features and normalize real valued features
- Build simple models with complete feature set
- Based on the bias variance trade-off and performance metrics like precision, recall and AUC score, analyze performance by performing feature selection
- Experimenting with SMOTE for upsampling of minority class
- Compare performance with ensembles like boosted trees
- Final feature engineering using K-Means and Decision Stumps
- Trying similar approach for other two tasks
- Deployment on Heroku
1. Exploratory Data Analysis and missing values imputation
We have two versions of the data and both have 50,000 samples but the large version contains 15,000 features and the small version contains only 230. For our use case we shall focus only on the smaller version with 230 features.
df = pd.read_table('orange_small_train.data')
We have 50 thousand samples with 230 features that are all anonymized. Let’s see different data types we have.
We have only numerical and string or categorical data. Next we address the elephant in the room i.e. missing values.
By eyeballing just numerical features, it looks like we have many features with large number of missing values. We must do quantitative analysis and try to calculate following for each feature:
- Number of missing values
- Percentage of missing values
We observe that there are 136 features with more than 90% missing values, so we must get rid of them and there are 9 features with less than 90% but more than 30% missing values.
Looking at the blue line on the plot we decide to drop them too, but only after creating missing value indicator features below. Later on we will see in feature importance section that these features show significant importance.
df_missing_indicator = df.isna().astype(int).add_suffix('_indicator')
Remove all the features with more than 30% missing values and now we can do imputation on remaining features.
threshold = 30
df_short = df.drop([i for i,j in dict(zip(df.columns.values, missing_vals_per)).items() if j > threshold], axis=1)
We are left with 67 features, out of which 39 are numerical and rest are categorical.
We must also check for class imbalances.
For each task we have different target variable, and these plots depict that we have a highly unbalanced class distribution for all three tasks.
We can also check for correlation among numerical features.
Apart from Var22 which is perfectly correlated with Var21, no other feature shows high correlation with any another feature.
Numerical Value Imputation
We tried various numerical value imputation methods like
- Mean replacement,
- Median replacement,
- Max or min replacement,
- Replacing with zero,
- Predicting with KNN
and found that mean replacement is giving better performance.
X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_train.mean())
Categorical Value Imputation
For categorical values we treated missing values as an ‘unknown’ category.
2. Encoding Categorical Features and experimenting with models
For all 28 categorical variables first we understand the number of unique values in each variable. This will help us pick the right encoding method.
One feature has more than 12000 unique values, while 6 other features has unique value count between 500 and 6000. This shows our categorical features have high cardinality.
There are many types of encodings for categorical features. We tried following encodings with every model we build.
- One Hot Encoding
- Leave One Out Encoding
- Binary Encoding, &
- Frequency Encoding
We found frequency encoding worked best overall for all the three tasks.
Compared to all other encodings logistic regression showed best performance with frequency encoded variables. With AUC of 68.33, precision of 0.10 and recall of 0.81 on minority class. It was also observed that this model least over-fitted compared to other classifiers.
Now when we tried with various upsampling techniques like SMOTE, SMOTE + random downsampling of majority class, we observed that there was no improvement in the performance on the test set and it became harder to prevent overfitting.
Decision Tree Classifier
We also experimented with all encoding techniques and upsampling techniques with decision tree algorithm. Decision tree divides the feature space into axis parallel hyper spheres and therefore can capture the non-linearities in the data very well.
However it did not worked well for our case.
Skipping the results of SVM and random forest because they were not worth mentioning, but you can check them in this jupyter notebook.
XGBoost gave an AUC of 0.72 which is similar to what we got from sklearn’s gradient boosting classifier, without any feature engineering or dimensionality reduction. But the major difference compared to other algorithms was recall of 0.63 and precision of 0.14 on minority class. It is still lower than logistic regression but overall AUC is higher.
We saw earlier winners of this challenge also used ensembles to achieve an AUC of ~0.75 on this task.
Trying XGBoost with SMOTE didn’t increase AUC significantally whereas precision and recall reduced. Hence we discarded that model. Now we can check for the feature importance.
We see only few of the features have some importance rest all have become moot. We see numerical, frequency encoded and indicator features.
After experimenting with combinations of models and upsampling techniques we came to the conclusion that our best performing model is XGBoost with highest AUC scores and reasonable Precision/Recall scores on minority classes.
3. Feature Engineering
- Since we have lot of missing values we added count of nan’s and count of zeros in each sample as two new features.
Number of nan’s feature proved useful while number of zeros did not.
- We also created a new feature using decision tree. We checked for highest percentage of node path followed by the samples and created that interaction as a new feature.
This feature did not improved the model performance.
3. We tried clustering with K-Means++ and experimented with 5 to 3000 clusters but it did not increase our AUC.
4. Predictions of logistic regression were also used as a feature in XGBoost, while this completely altered our feature importance but AUC improved in-significantatally.
Final AUC plot with feature engineering of task churn:
4. Results on other tasks
We also tried similar approaches on other two tasks, appetency and upselling so we will directly share the results of best model which is XGBoost classifier.
Appetency : 0.848
Upselling : 0.849
Now we dump these 3 models in pickle files along with scaler object that we created to standardize numerical features, to be used in deployment pipeline.
5. Deloyement on Heroku
Had little experience with Django before therefor I chose to create pipeline for this model using the django portfolio project which also has other apps of our other machine learning projects.
We made a pipeline to process the input because we want users to send data from browser and the model from back-end will return predictions to the user on our webpage.
Below class takes an array of length 299 as input, which is our raw data sample vector size and
- processes the array,
- encodes categorical data,
- standardizes numerical data,
- adds the features that were used to train the model
and returns class predictions with un-caliberated probability scores.
Now using heroku CLI we can deploy our django app on heroku, and this is how it is looking at this link.
Since we were already using git, next time if we make some changes, we just have to commit the changes and push the app on heroku master.
We can try different feature engineering hacks to come up with new features that may improve our AUC score on all three tasks.
We can also build individual classifiers and take majority voting to make a powerful classifier.
Try different algorithms like CatBoost, which is faster and has less number of hyper-parameters.