3 Challenges While Working on Classification — the Arvato Case

stluc
The Startup
Published in
11 min readSep 22, 2020

--

Photo by Ricardo Gomez Angel on Unsplash

The Goal

The necessity to predict trends or to predict new possible customers is one of the biggest challenges and applications in the field of machine learning and data science, which is also heavily requested by businesses.

Unfortunately the process of classifying from data and make accurate predictions often requires a lot of work and the quality of the data and its processing is crucial for the final overall results.

The final goal is to participate in the Kaggle competition on Arvato Dataset, and we will go over the biggest challenges which occurred during the process of producing a capable model.

The task is to correctly predict new possible customers for a mail handout marketing campaign for the company, which offers products by mail order. This mostly comes up to binary classification problem.

The solution we are going to test for this classification makes use of one of the typical machine learning algorithms and therefore giving a clean and meaningful dataset as input is very important.

The company Arvato handed out four datasets of German customers to work on the model, which are characterized by a high imbalance, a relatively high number of features, already partially tied up and a high number of samples.

The datasets available were split among four different CSV files:

  1. Demographics data for the general population of Germany; 891 211 persons x 366 features.
  2. Demographics data for customers of a mail-order company; 191 652 persons x 369 features.
  3. Demographics data for individuals who were targets of a marketing campaign; 42 982 persons x 367 features.
  4. Demographics data for individuals who were targets of a marketing campaign; 42 833 persons x 366 features.

Along with them a descriptor file was available for most of the features to briefly describe the features and context and the meaning of the values.

The number of features for the standard dataset was 366, an additional feature was present in the mail marketing campaign with RESPONSES, useful to train a binary classifier. The final evaluation for the competition is performed on the fourth (and last) dataset and the AUC score is used as metrics (we will see later why).

Now that the contest is all set we can go over the challenges which inevitably occurred.

1. Understanding the data

The main challenge during the preprocessing phase was given by interpreting the different features given.

The final steps necessary were:

  1. Replace the values representing unknown data for each feature with NaNs in order to be able later to impute the DataFrame
  2. Rework the categorical columns and ordinal encode them for imputation
  3. Use of IterativeImputer from SciKitLearn with n_nearest=4 to derive the missing data
  4. One-hot encode all features which represent categories according to attached description file
  5. Reduce memory usage to the minimum feasible

But it was not as easy as that, while here the issues were countless:

  • Though a descriptor file was made available, it was partial and many feature names were different from the dataset names, which required hunting for the individual labels in the DataFrame.
  • Most of the features were numerical but their meaning was not! They were already encoded as ordinals and a lot of time was necessary to understand what was what.
  • Most of the features implemented different styles to mark missing values. Either they were missing from the input or were represented as 0s, -1s, 9s. And this could hold true for any one label all at the same time. Detailed reading of the descriptor was necessary.
  • The imputer selection took a lot of time, and before fitting the models at a later point, no insight of the best solution was readily available.
  • Refactoring, refactoring, refactoring, refactoring, refac…

Hence, the final refactored function to do the cleanup and preprocessing of the datasets looked like this:

Challenge #1: there is always something that can be improved in the preprocessing of the data and a lot of time needs to be invested to understand the features and their real meaning. Getting down early the proper functions to handle the data is FUNDAMENTAL to spare time through the countless iterations.

2. Customer segmentation

With the first two datasets is now possible to see the similarities between the German demography and the company customers.

The most common unsupervised learning method is clustering, which is used for exploratory data analysis to find hidden patterns or grouping in data. The clusters use a measure of similarity defined by metrics such as Euclidean distance.

Such algorithms could be for example K-Means Clustering, Agglomerative Hierarchical Clustering, Gaussian Mixture Models, DBSCAN, Mean-Shift Clustering and more.

A K-Means clustering is hereby adopted, but surely further implementations could lead to better results.

This part was also challenging due to my spare knowledge on the topic of unsupervised learning and therefore I chose to focus only on one technique which could provide relatively quick results among the various algorithms.

At first the data was scaled and after that a Principal Components Analysis was run to reduce the complexity.

We can see that the components explaining at least 90% of the variance is ~170 and therefore we refit the PCA model with this many components to reduce complexity.

I then fit the K-Means clustering model with variable number of clusters to produce the elbow chart and try to find the optimal number by means of the for error and complexity reduction.

The value I selected is 10 clusters but the elbow was not so prominent.

I used this value to finally assess the cluster difference between the overall German demography and our customers.

In the notebook provided on GitHub (see the end of the article), it is also displayed how one could then pin each cluster back to the specific features weighing the most on each cluster and therefore providing a better picture on the similarities.

Challenge #2: when many features are present, a reduction is necessary to improve readability of the data, but this still doesn’t help too much in the human visualization. Working on the pin pointing the features to the clusters during the unsupervised learning could help a lot in understanding the similarities between data sets.

3. “Bring balance to the force”

Here comes the core of the task, since the data available in the third dataset is heavily imbalanced: almost 99% of the contacted people did not respond to the marketing mail campaign. SURPRISE!

We have therefore very few positive samples in the output for this binary classification problem, with the minority class represented by the people who responded to the campaign.

Now here the usual metrics do not hold true, so I had to understand how to properly evaluate the performance of the models that I would create.

The precision or normal scoring techniques are not reliable metrics, since they would wrongly score high due to the high precision in the most represented class. Therefore the Area Under the Curve (AUC) score is used from the ROC curve. For more detailed information please look at this very good article.

For each model tested a 30% or even sometimes 40% of the data from the 3rd Dataset was used to validate, while the rest was used for training the model.

Train data (blue), Validation data (orange)

As benchmark model I selected at first a LinearSVC model from Scikit-Learn without further optimization and as it can be seen in the picture, as expected, the precision for the minority class is 0 as well as the recall. Our model also scored a low AUC (0.63) in the validation (orange curve).

I then tried to focus on improving the minority class Accuracy and Recall as well as the validation AUC, which unfortunately still scored poorly also after many tryouts.

Of course, I could use my hard-earned cleaning function to preprocess this dataset, but beforehand I had to do further optimizations to smoothen all the little issues that occurred during the processing.

The re-indexing function from Pandas came in handy, since it made possible to work with the same labels and order, since the processing could produce different dummy variables during the one-hot encoding depending on the available data.

The whole concept of handling imbalanced data was new to me, so I had to make my researches trying to understand which were the best tackling techniques. I had to pleasure to read different articles, one which is beginners-friendly can be found here.

Most of it comes down to different degrees of sampling techniques on the dataset and weighting on the outputs during the training process.

A combination of under sampling techniques for the majority class (e.g. Random Undersampling) and over sampling techniques for the minority class (Random over sampling, synthetic sampling such as TomekLinks, SMOTE, ADASYN, etc..) can be used, with the aim in the end to obtain a balanced dataset for the training of our model.

For each model I performed an optimization by means of GridSearchCV package.

Improvements were observed when introducing balancing ensembles from imblearn, which randomly under sample the majority class, hence working with inputs which are more or less equal.

It was observed that a lot of false minority class predictions were made with this ensembles, but an idea to reduce the issue is by better implementing the aforementioned sampling techniques to balance out the classes instead of the random sampling which is perfomed by the ensembles. Unfortunately up to this point I was not able to use them with some progress only in Neural Network approach and definitely much more time should be spent on the fine tuning of them to find the best combination and algorithm for the issue at hand.

The final selection of models which were tested using SciKit-Learn, imblearn and Pytorch were:

  • Support Vector Classification (linear)
  • Balanced Random Forest classifier from imblearn
  • Balanced Bagging classifier from imblearn with Decision Tree estimator
  • A deep neural network (not in-depth developed due to brevity)

In the end the best performance was achieved with the balanced bagging classifier and decision tree estimator (see GitHub at the end for details in the comparison), and the :

ROC for Bagging Classifier w/ Decision Tree estimator

The decision tree works better (AUC 0.76) than both our balanced random forest model (which scored AUC 0.74) and the simple linear support vector classifier (AUC 0.64), but overall the differences among the different balanced solutions tested were minor.

A neural network was also laid to start testing with the deep learning approach in order to verify if a quick and big improvement could be achieved.

Neural net result (2 hidden layers)

Even though the first results look promising (AUC is really low, but precision is comparable with other model and recall is non-null), an hyperparameter tuning is necessary as well as the exploration of other net structures.

Overall the tuning and optimization of the above discussed strategies could lead to further improvements.

For the purpose of this project we just select the best resulting model for the ROC-AUC score, which is the Bagging Classifier with Decision Tree estimator.

Challenge #3: when handling imbalanced data, correctly predicting the minority class is very hard and the models offer extremely low accuracy as is. A lot of time must be spent to understand how to improve the imbalance, fine tuning the inputs and after that, the model selection should offer better results overall. Also the implementation of the neural network is promising but very time consuming in tryouts, starting structures from literature could be a good starting point to improve on this solution.

Conclusion

Overall the model was not tuned to perfection and it can be seen also from the Kaggle leaderboard that much higher results have been achieved.

The model overall scored 0.7399 on the Kaggle competition (position 169), where at the time of writing the leader reaches almost 0.85, meaning there is a lot of improvement which can still be achieved.

The end-to-end approach was the following:

  • Understanding the data
  • Reducing memory requirements, redundancies, simplify the features
  • Properly selecting the categorical features
  • Imputing the missing values
  • One-hot encode the categorical features
  • Clustering the customers and Germany datasets to verify similarities (could be tested also as input to the following models which was not done here)
  • Verifying the mail dataset output distribution for model training → imbalanced
  • Verifying a simple model perfomance
  • Introducing the sampling techniques to reduce imbalance
  • Selecting the best performing model according to AUC-scoring which is used for the final Kaggle competition

The main 3 challenges we faced were:

  • Challenge #1: there is always something that can be improved in the preprocessing of the data and a lot of time needs to be invested to understand the features and their real meaning. Getting down early the proper functions to handle the data is FUNDAMENTAL to spare time through the countless iterations.
  • Challenge #2: when many features are present, a reduction is necessary to improve readability of the data, but this still doesn’t help too much in the human visualization. Working on pin pointing the features to the clusters during the unsupervised learning could help a lot in understanding the similarities between data sets.
  • Challenge #3: when handling imbalanced data, correctly predicting the minority class is very hard and the models offer extremely low accuracy as is. A lot of time must be spent to understand how to improve the imbalance, fine tuning the inputs and after that, the model selection should offer better results overall. Also the implementation of the neural network is promising but very time consuming in tryouts, starting structures from literature could be a good starting point to improve on this solution.

The project was a perfect example on how to work with messy data which needs a lot of preprocessing, was useful to apply all that was learned during the Udacity Data Scientist course, such as how to handle categorical and numerical data and was useful to explore machine and deep learning techniques further. Hands-on experience with heavily imbalanced dataset was very important to understand also the importance of quality datasets.

Further improvements can be achieved by better data sampling techniques (under-, over- and synthetic sampling), by use of proper weighting and structure especially in the deep learning approach, as well as more simply by getting more real samples of the minority class. A possible improvement could come also from the preprocessing step, by better features sorting or eventual reduction.

I’d like to thank Udacity and Arvato for the opportunity to work on the topic, get my hands dirty and gather precious experience.

To see more about this analysis, see the link to my GitHub available here.

--

--