Machine Learning and Its Application in Banking Institution Problem

Adhi Rakhmat
Mandiri Engineering
6 min readJun 9, 2020

Disclaimer: This article is written for educational purpose only, and it is aiming non-expert to beginner readers who want to learn about how Banking Institutions can collaborate with machine learning. Besides, we can get more in-depth insight from customer data to NPL prediction and customer segmentation. That insight is expected to help the decision-maker to create or improve an existing product. The detailed process and result won’t be shown clearly to protect customer data, misleading information due to current model capability, and conceals company future strategies.

Part 1: What is Machine Learning?

Do you know what Machine Learning is? There are a lot of definitions of that term, but in this article, the writer would like to make it as simple as ‘the machine that learns’. Hmm, if that definition seems so simple, let the writer explain further about this term of a machine as ‘a baby’. When we first came to this world, as ‘a baby’, we couldn’t do anything. All we could do was crying and moving our bodies randomly. Gradually, we learned how to understand words and coordinate our body until we can speak fluently and even walk by ourselves.

Of course, we won’t succeed in one effort, and maybe we need dozens or hundreds or thousands attempt. Each time we fail, our brain will learn from it. The more we try, the more data will be generated by our mind, so it can determine the next attempts/iterations and get a closer gap from the target until we achieve the goals.

Part 2: Types of Machine Learning

There are two major types of Machine Learning by labelled data: supervised (with labelled data) and unsupervised (without labelled data). Here is the illustration to help you understand:

From the illustration above, it shows that:

  • In supervised learning, it has a label to guide (male & female). The score from a model can be calculated by comparing prediction with the label. For the example, if prediction from a model to a new dataset is ‘male’ and the label is ‘male’, so the prediction is true, otherwise it is false.
  • In unsupervised learning, it does not utilize any label to guide the prediction, but instead, the machine utilizes similarity in the data or behaviour. So, there is no exact calculation to score a model. For example, if there is a new dataset (the green one), the machine will determine if the new dataset has similarity or behaviour with the green cluster. It is harder to determine if the answer is true or false.

In this article, the used case for supervised is NPL prediction and the used case for unsupervised learning is customer segmentation.

Part 3: Supervised Learning — NPL Prediction

The writer uses Logistic Regression to predict if someone is categorized as Performing Loan (PL) or Non-Performing Loan (NPL). Here are several methods and ideas that the writer put on developing the model:

  1. The writer uses several features: type of customer’s business; loan original amount; Current outstanding; job type; loan to value; etc.
  2. The observation is within one year. So, if the writer using January last year as the base, it will be compared with January this year.
  3. The writer uses class weight to handle imbalanced datasets (e.g. only 1% NPL from 100% portfolio). So, the calculation gives a higher penalty if the model is wrong to predict NPL rather than PL.
  4. The writer will use the F1 score to calculate the accuracy due to its imbalanced dataset. The result appears as a confusion matrix below: This is how to calculate Precision and Recall:
  • This is how to calculate Precision and Recall:
  • So, Precision is the ratio of all correct positive prediction (NPL predicted as NPL) divided by all positive prediction (PL predicted as NPL and NPL predicted as NPL). Other than that, Recall is the ratio of all correct positive prediction (NPL predicted as NPL) divided by all positive actual (NPL predicted as PL and NPL predicted as NPL).
  • This is how to calculate the F1 score (basically, F1 Score is the weighted average of Precision and Recall):

The F1 Score for the current model (‘1’ is NPL and ‘0’ is PL) is:

In this scenario, the writer uses 70% of data as the train dataset and 30% of data as the test dataset. First, Machine Learning will search the pattern using the train dataset. After Machine Learning learns from the train dataset and gets the model, it will be applied to the test dataset. So, train score represents the score of actual data vs prediction from train dataset. And so does for the test score. The differences between train and test score are due to the difference dataset for train and test.

From the confusion matrix above, it shows that the model still makes a lot of mistakes on False Positive (PL predicted as NPL). But, from the writer’s perspective, the model could predict better on False Negative. As a result, the F1 Test Score is 28.5%. This is still categorized as a bad result.

Part 4: Unsupervised Learning — Customer Segmentation

The writer uses UMAP and Spectral Clustering to reduce the data dimensions and performs customer segmentation. Here are several methods and ideas that the writer put on developing the model:

  1. The writer uses several features: type of customer’s business; loan original amount; Current outstanding; job type; loan to value; etc.
  2. The data itself has many features (high dimensional) and UMAP reduces them into 2-dimensional so the writer can visualize the data. This 2-dimensional UMAP visualization gives more insight to segment the customer.
  3. The writer then performs customer segmentation on the 2-dimensional using Spectral Clustering. From the picture below, the same cluster will be represented from the same colour.
  4. After clustering visualization of customer segmentation, the writer then analyzes each cluster to find insightful patterns in the data.

The illustration of UMAP and Spectral Clustering shown below:

Part 5: Conclusion

These 2 cases show that machine learning has the opportunity to help Banking Institutions to solve a financial problem and help the decision-maker to create or improve an existing product. The result, especially for supervised learning, still not satisfy the writer (and maybe the readers). But the writer will try to develop a better model for both of the used case.

In the end, the writer hopes that the readers will still understand the impact of collaboration between a company and machine learning to create a new insight which may humans never thought before and the difference between supervised and unsupervised learning in the real-life problem.

--

--