Modeling imbalanced datasets
One of the major problems of training a machine learning algorithm with imbalanced data is that the algorithm will tend to predict the majority class. Furthermore, by always selecting the majority, an algorithm can be falsely assumed to have correctly modeled the data. In this article, I will share some methods of dealing with such data using an example of a problem in the banking sector.
One of the main aims of most businesses is to find new customers while maintaining the old (existing ) ones. In the banking sector, which has high competition, maintaining old customers can be a bit difficult as there are a lot of alternatives for customers to patronize
In most banks, the investment and portfolio department would want to identify their customers who potentially would subscribe to their term deposits. As a result, most marketing managers have heightened interest to carefully tune their direct campaigns to the rigorous selection of contacts; the goal is to find a model that can predict which future clients would subscribe to the term deposit. Having such an effective predictive model can help increase campaign efficiency. It would be possible to identify customers who would subscribe to the term deposit and direct marketing efforts.
This increased efficiency will help in managing bank resources such as human effort, phone calls, time. The Bank, therefore, collected a huge amount of data that includes customer profiles of those who have to subscribe to term deposits and the ones who did not subscribe to a term deposit. The goal is to develop a robust predictive model that would help identify customers who would or would not subscribe to their term deposits in the future. Data containing the following features were obtained.
1 — age (numeric)
2 — job: type of job (categorical:’admin.’,’ blue-collar,’ entrepreneur,’ ‘housemaid,’ ‘management,’ ‘retired,’ ‘self-employed,’ ‘services,’’ student,’ ‘technician,’ ‘unemployed,’ ‘unknown’)
3 — marital: marital status (categorical: ‘divorced,’ ‘married,’ ‘single,’ ‘unknown’; note: ‘divorced’ means divorced or widowed)
4 — education (categorical):’ basic.4y’, ‘basic.6y’, ‘basic.9y’, ‘high. school’, ‘illiterate,’ ‘ professional. course’, ‘university. degree’, ‘unknown’)
5 — default: has credit in default? (categorical: ‘no,’ ‘yes,’ ‘unknown’)
6 — housing: has a housing loan? (categorical: ‘no,’ ‘yes,’ ‘unknown’)
7 — loan: has a personal loan? (categorical: ‘no,’ ‘yes,’ ‘unknown’) # related with the last contact of the current campaign:
8 — contact: contact communication type (categorical: ‘cellular,’ ‘telephone’)
9 — month: last contact month of the year (categorical: ‘Jan, ‘Feb,’ ‘mar,’…, ‘Nov,’ ‘Dec’)
10 — day_of_week: last contact day of the week (categorical: ‘mon,’’ Tue,’wed,’’Thu,’’ Fri)
11 — duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0, then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call ‘y’ is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
12 — campaign: number of contacts performed during this campaign and for this client (numeric, includes the last contact)
13 — pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 — previous: number of contacts performed before this campaign and for this client (numeric)
15 — poutcome: outcome of the previous marketing campaign (categorical: ‘failure,’ ‘nonexistent,’ ‘success’)
Social and Economic context attributes
16 — emp. var.rate: employment variation rate — quarterly indicator (numeric)
17 — cons.price.idx: consumer price index — monthly indicator (numeric)
18 — cons.conf.idx: consumer confidence index — monthly indicator (numeric)
19 — euribor3m: Euribor 3 month rate — daily indicator (numeric)
20 — nr. employed: number of employees — quarterly indicator (numeric)
Output variable (desired target):
21 — y — has the client subscribed to a term deposit? (binary: ‘yes, ‘no’)
Due to the number of categorical features in the data coupled with the imbalanced nature of the data, some stages of preprocessing are essential to obtain a good model.
A quick count of the distinct value in the y column shows an unequal distribution.
The data were preprocessed by treating the categorical columns with one hot encoder. The outliers were removed using the interquartile value as shown below.
MinMaxScaler was used to scale the data, enabling the PCA algorithm to decompose the data effectively.
PCA analysis was then used to reduce the high dimension of the data, and this is done to enable the algorithms to fit the model appropriately. The code for the PCA analysis can be found in the repository. About 95% of the variance was retained after the decomposition.
MODELLING AND PERFORMANCE EVALUATION
As mentioned earlier, the data is an imbalanced data. This problem can be solved using the imblearning library through oversampling or undersampling. There are two algorithms worth mentioning in this regard: SMOTE and TomekLinks algorithms.
The Tomek Links algorithm removes data from the majority class that has Tomek links. A tomek link is defined as data of different classes which are nearest neighbors of each other. This is an undersampling technique.
The Synthetic Minority Over-sampling Technique (SMOTE) generates new points for the minority class by fully connecting all points in the minority class with straight lines. Then for each existing data point, SMOTE then determines a point on these interconnections to make a new point based on how many of the closest neighbors are considered for synthesis. This is an oversampling technique.
In this study, the SMOTE algorithm yielded better performance. This could be expected because undersampling tends to drop out some important information.
The Machine learning algorithms employed are LogisticRegression, MLPClassifier, XGBoost. The particularly useful models were classification problems of this nature. K-fold and Stratified K-fold were used in splitting the data to enhance performance.
However, there was no significant difference in performance while using either. This can be attributed to the operation carried out on the data by the SMOTE algorithm. RandomizedSearchCV and GridSearchCV were used to tune the hyperparameters of MLPClassifier and Logistic Regression, respectively.
The performance of the algorithm was estimated using the f1 score. It takes precision and recalls into account. It can be a better measure than accuracy on imbalanced binary classification data.
The accuracy of the algorithms is given as given follows.
The XGBoost algorithm had the highest F1 score
The code and data for this project can be found here.