Classifying Client Behaviour with ML

Artem Arutyunov
The Power of AI
Published in
10 min readMar 22, 2023

In today’s world, banks are handling an enormous amount of data every day, including transactions, user behavior, and account information. This data can be analyzed to gain insights into client behavior, which is essential for developing effective business strategies, identifying potential fraud, and improving customer experience. One crucial aspect of data analysis in banking is the classification of client behavior. In this blog, we will discuss various techniques and tools banks can use to classify their clients. We will see how we can use multiple python libraries for preprocessing of the dataset, features selection, and model training to classify the data. So, let’s dive in!

To see all of the detailed explanations for the mentioned concepts and analyze/experiment with the code for this blog. Click on:

You can also take many FREE courses and projects about data science or any other technology topics from Cognitive Class.

Data:

Let’s see what kind of data we will be working with. The data that we are going to use is a subset of an open-source Bank Marketing Data Set from the UCI ML repository. The data is related to direct marketing campaigns of a Portuguese banking institution and in our case, the classification goal will be to predict if the client will subscribe to a term deposit. We will be using the pandas library to work with data. Let’s first of all load the data:

df = pd.read_csv('bank-additional/bank-additional-full.csv', sep = ';')
df.head(5)

Let’s also check the shape of our data by using:

df.shape
#Returns (41188, 21)

As you can see DataSet consists 21 columns. Last column — target. Also DataSet consists 41188 rows. Some of the columns are:

Input features (column names):

  1. age - client age in years (numeric)
  2. job - type of job (categorical: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown)
  3. marital - marital status (categorical: divorced, married, single, unknown)
  4. education - client education (categorical: basic.4y, basic.6y, basic.9y, high.school, illiterate, professional.course, university.degree, unknown)
  5. default - has credit in default? (categorical: no, yes, unknown)
  6. housing - has housing loan? (categorical: no, yes, unknown)
  7. loan - has personal loan? (categorical: no, yes, unknown)
  8. contact - contact communication type (categorical: cellular, telephone)
  9. euribor3m - euribor 3 month rate, daily indicator (numeric)
  10. nr.employed - number of employees, quarterly indicator (numeric)

Output feature (desired target):

  1. y - has the client subscribed a term deposit? (binary: yes,no)

*If you wanna see the complete list of columns, click here.

Note that all categorical features are recognized as objects. We must change their type to “categorical”.

col_cat = list(df.select_dtypes(include=['object']).columns)
df.loc[:, col_cat] = df[col_cat].astype('category')

To see the unique values of the exact feature (column) we can use:

As was signed earlier the dataset contains 41188 objects (rows), for each of which 21 features are set (columns), including 1 target feature (y). 11 features, including target are categorical. These data types of values cannot be used for classification. We must transform it to int or float.
To do this we can use Label Encoder and Ordinal Encoder. These functions can encode categorical features as an integer array.

Let’s separate DataSet on input and output(target) DataSets and convert categorical to numerical:

X = df.iloc[:,:-1]  #input columns
y = df.iloc[:,-1] #target column
col_cat = list(X.select_dtypes(include=['category']).columns)
oe = OrdinalEncoder()
oe.fit(X[col_cat])
X_cat_enc = oe.transform(X[col_cat])

#Convert back into the dataframe
X_cat_enc = pd.DataFrame(X_cat_enc)
X_cat_enc.columns = col_cat
X_cat_enc

Numerical fields can have a different scale and can consist of negative values. These will lead to round mistakes and exceptions for some AI methods. To avoid it these features must be normalized.

Let’s create a list of numerical fields and normalize it by using MinMaxScaler

col_num = ['age', 'duration', 'campaign', 'pdays',
'previous', 'emp.var.rate', 'cons.price.idx',
'cons.conf.idx', 'euribor3m', 'nr.employed']

#Apply the scaler
scaler = MinMaxScaler(feature_range=(0, 1))
X_num_enc = scaler.fit_transform(X[col_num])

#Convert back into the dataframe
X_num_enc = pd.DataFrame(X_num_enc)
X_num_enc.columns = col_num
X_num_enc

Then we should concatenate both data frames in one input data frame and not forget to do the same thing with a target variable as well. Now we have a processed dataset for the initial analysis.

Features selection:

As was mentioned before, input fields consist of 20 features. Of course, some of them are more significant for classification. There are two popular feature selection techniques that can be used for categorical input data and a categorical (class) target variable.

They are:

  • Chi-Squared Statistic.
  • Mutual Information Statistic.

Let’s take a closer look at each in turn. To do this we can use SelectKBest, a method from scikit-learn library. Let’s see what it is and how we can do it:

Chi-Squared Statistic: Pearson’s chi-squared statistical hypothesis test is an example of a test for independence between categorical variables.You can learn more about this statistical test in the tutorial: A Gentle Introduction to the Chi-Squared Test for Machine Learning.

The results of this test can be used for feature selection, where those features that are independent of the target variable can be removed from the dataset. The scikit-learn machine library provides an implementation of the chi-squared test in the chi2 function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.

We can define the SelectKBest class to use the chi2() function and select all (or most significant) features, then transform the train and test sets. Let’s apply SelectKBest class to extract the top 10 best features:

bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(x_enc,y_enc)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)

Let’s see which features are the most significant ones:

Specs:       Score:
euribor3m 890.69
loan 547.96
emp.var.rate 541.30
nr.employed 502.66
poutcome 441.45
campaign 358.02
education 321.92
marital 167.61
previous 157.98
day_of_week 98.23

Let’s apply another method and compare the results:

Mutual information from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection. Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable. You can learn more about mutual information in the following tutorial.

The scikit-learn machine learning library provides an implementation of mutual information for feature selection via the mutual_info_classif function. Like chi2(), it can be used in the SelectKBest feature selection strategy (and other strategies).

bestfeatures = SelectKBest(score_func=mutual_info_classif, k=10)
fit = bestfeatures.fit(x_enc,y_enc)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)

Let’s see which features are the most significant ones:

Specs            Score
campaign 0.08
euribor3m 0.07
cons.price.idx 0.07
cons.conf.idx 0.07
nr.employed 0.06
emp.var.rate 0.05
previous 0.04
day_of_week 0.04
contact 0.03
poutcome 0.02

As you can see these 2 functions select different significant features. Let’s try one more method.

Feature importance: You can get the feature importance of each feature of your DataFrame by using the feature importance property of the exact classification model. Feature importance gives you a score for each feature of your data, the higher the score more important or relevant the feature is to your output variable. For example, Feature importance is an inbuilt class that comes with Tree-Based Classifiers (scikit-learn library), we will be using Extra Tree Classifier for extracting the top 10 features for the dataset.

#Train our Model
model = ExtraTreesClassifier()
model.fit(x_enc,y_enc)

#Extract the features importance, transform it into Series and plot them.
feat_importances = pd.Series(model.feature_importances_, index=x_enc.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

As you can see Extra Tree Classifier's importance of features are different than in previous cases. It means that there are no exact rules for feature selection. And their importance is strictly dependent on the model.

Correlation Matrix with Heat-map:

Correlation states how the features are related to each other. Correlation can be positive (an increase in one value of a feature increases the value of the other variable) or negative (an increase in one value of the feature decreases the value of the other variable). Heat-map makes it easy to identify which features are most related to the other variable, we will plot the heat-map of correlated features using the seaborn library.

corrmat = x_enc.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
g=sns.heatmap(x_enc[top_corr_features].corr(),annot=True,cmap="RdYlGn")

As you can see fields ‘euribor3m’, ‘emp.var.rate’, ‘nr.employed’ strictly correlate with each other. It means that two of them must be removed from the calculation because there are linear dependencies between them. If we know one of them we can easily calculate another two. We will remove ‘emp.var.rate’ and ‘nr.employed’ from the dataset.

Now know which features to include we will try to make our predictions using our processed dataset and classification models.

Classification Models:

First of all, we must separate DataSets for training and testing DataSets to calculate the accuracy of our models. To do this we can use train_test_split. Let’s separate DataSets in 0.33 train/test proportion:

X_train, X_test, y_train, y_test = train_test_split(x_enc, y_enc, test_size=0.33, random_state=1)

Logistic regression: There are many different techniques for scoring features and selecting features based on scores; how do you know which one to use?

A robust approach is to evaluate models using different feature selection methods (and numbers of features) and select the method that results in a model with the best performance.

Logistic regression is a good model for testing feature selection methods as it can perform better if irrelevant features are removed from the model. Let’s train it and evaluate the accuracy:

model = LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)
yhat = model.predict(X_test)
accuracy = accuracy_score(y_test, yhat)

It returns a stunning 91% accuracy, let’s try one more

Decision Trees are a popular supervised learning method for a variety of reasons. The benefits of decision trees include that they can be used for both regression and classification, they don’t require feature scaling, and they are relatively easy to interpret as you can visualize decision trees. This is not only a powerful way to understand your model, but also to communicate how your model works.

A Decision Tree is a supervised algorithm used in machine learning. It is using a binary tree graph (each node has two children) to assign for each data sample a target value. The target values are presented in the tree leaves. To reach the leaf, the sample is propagated through nodes, starting at the root node. In each node, a decision is made, as to which descendant node it should go, and the decision is made based on the selected sample’s features. Decision Tree learning is a process of finding the optimal rules in each internal tree node according to the selected metric. Let’s fit a Decision Tree model from scikit library:

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
yhat = model.predict(X_test)
accuracy = accuracy_score(y_test, yhat)

It gives us an accuracy of 89.14%, a bit worse than a linear regression one.

As was mentioned earlier, this method allows us to calculate the features' importance. Let’s experiment a bit, by calculating them, choosing the best 10 of them to refit the model, and see if the accuracy has increased or decreased.


feat_importances = pd.Series(model.feature_importances_, index=x_enc.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

Let’s refit the model on the most important features:

col = feat_importances.nlargest(10).index
X_train_dt = X_train[col]
X_test_dt = X_test[col]
model.fit(X_train_dt, y_train)
yhat = model.predict(X_test_dt)
accuracy = accuracy_score(y_test, yhat)

It returns us an accuracy of 89%. As you can see Accuracy is a little worse because we don’t use all features. So sometimes removing features can be harmful to the model's accuracy. It’s important to note that we kept only half of all features but the accuracy decreased only by 2%, which is a great increase in the training efficiency.

Visualization of the decision tree:

As was mentioned earlier, Decision trees are relatively easy to interpret as you can visualize them. let’s visualize the decision tree that we have trained earlier. There are many ways to do it. One of the fastest ways is to use python-graphviz library. You the following code to do it:

import graphviz
dot_data = tree.export_graphviz(model,
feature_names = col,
class_names = y.unique(),
filled=True)

After creating, you can plot the graph:

graph = graphviz.Source(dot_data, format="png") 
graph
Visualisation of a Decision Tree

As you can see, our decision tree is very big and complex, but to better visualise how it’s working, here’s a small part of it:

You can also render it into file:

graph.render("decision_tree_graphivz")

In conlcusion, we were able to do preliminary data processing. In particular, change data types, normalize and process categorical data, make feature selection by different methods. Learned how to build training and test DataSets. Shows how to work with different classifiers and how to visualize them. With the final result of an 90% accuracy, which is not bad at all.

If you want to know more ways some to visualise your Trees, check the code for the project, or experiment yourself with possible classifications models then click on Client behavior classification in Banking or explore other FREE courses and projects about data science or machine learning on Cognitive Class.

Thanks for reading.

--

--

Artem Arutyunov
The Power of AI

Hey, Artem here, I love helping people to learn, and learn myself. IBM Data Science Intern + Studying Math and Stats at University of Toronto.