Everything You Want to Know About Machine Learning in FinTech

Published in

Armada Labs

7 min readMay 29, 2019

* The article was initially published on our blog.

Description: By own example, Armada Labs explains why black-box underwriting models are free of biases and how FinTechs are using large amounts of financial data to build resilient risk models, improve their predictive capabilities with self-learning algorithms.

Access to financial data and personal information has always been seen as a subject of special custody and protection for institutions to whom it was granted. Until recently it was nearly impossible to get through the enormous pools of banking account information and transaction without filing thousands of pages in confidentiality agreements, even when you have customers consent to prove the legitimacy of your claim.

However, these days are left far behind. Today third-party providers of financial services enjoy full benefits of open APIs, which they use to build applications and fill up data lakes needed to train the machines. Data availability is a starting point of the journey to machine intelligent in FinTech as without data any algorithm is inapplicable and useless.

Machine Learning in Credit Risk Modeling

It’s no secret that cognitive technology, specifically machine learning and artificial intelligence, has the power to change the way a business entity tackles any number of functions. Primary due to its predictive nature, the self-learning models have drawn the sheer attention of lenders who want them to calculate which borrowers will default, or become delinquent, or prepay.

Let’s take an example to see how it works.

A business applies for a loan and the lender must evaluate whether the borrower is able to repay the loan principal due and interest expense timely and in full. Historically lenders use measures of profitability and leverage to assess credit risk. Given two loan applicants — one with high profitability and high leverage, and the other with low profitability and low leverage — which firm will get a loan? A profitable company generates enough cash to cover the principal due and interest expense. A highly-leveraged company, on the other hand, lacks equity available to weather economic disbalances. It seems quite a puzzle, isn’t it?

Here is how the visualization of data sorts things out. Figure 1 illustrates a decision tree of our lenders who have to decide whether or not fund a loan from certain borrowers.

Clearly, nobody will underwrite a loan considering only two factors, so other financial information such as liquidity ratio and net cash flow, behavioral information such as loan and trade credit payment behavior, and credit reports and scores must be included into the model. However, the more dimensions we add — the more complex model we get and summarizing all these relevant dimensions into one score is challenging, but machine learning algorithms help achieve this goal.

The main idea behind creating machine learning algorithms was to teach computers how to parse data, capture patterns, learn from it, and then make a determination or prediction regarding new data. This scenario is a sizable opportunity for the lenders to secure their interest and capitalize on data.

Machine Learning and Statistical Learning

The ultimate goal of machine learning mainly overlaps with its lower-profile sister field, statistical learning as both sciences attempt to investigate the underlying relationships by using a training dataset.

Yet the difference between these approaches is hidden in the way they define a pattern and learn from it. Typically, statistical learning methods assume formal relationships between variables in the form of liner mathematical equations, while machine learning methods can learn from data without requiring any rules-based programming.

As a result, machine learning technics are more flexible as they are able to capture non-linear relationships, which might be missed by classical statistical models full of constraints.

Let’s take a look at Figure 2 provided by Moody’s Analytics from their research on machine learning models Vs Moody’s Analytics RiskCalc model.

The first chart illustrates the actual distribution of data points with respect to X and Y, while the points in red are classified as defaults.

From the second chart, we can see that a linear statistical model fails to capture patterns from provided data as the resulting predictions are far off for most defaults. This happens due to the model’s inability to identify hot spots in non-monotonic data sets.

The third chart shows the prediction made by a versatile machine learning algorithm, random forest. As can be seen from the chart, the prediction almost perfectly matches the original data set, at times it contrasts starkly with a traditional model.

It can be stated that the toolset of machine learning algorithms such as random forest, boosting and artificial neural networks are the best solution for handling the non-linear and interactive effects of the explanatory variables that are too complex for a classical statistic, like the ones affect your credit score when you are extremely delinquent on settling your phone bill.

No wonder it has rooted among the lenders who long for any detail to find a key to their borrower’s creditworthiness, as data-driven models prove to be accurate and cost-effective unless it turns out to be a cat in a bag.

Black Box Underwriting Models

The black box metaphor typically refers to a system for which we can only observe the inputs and outputs, but not the internal workings.

Applicable to machine learning it works as follow: we put some raw data into a model that analyze them looking for patterns and relationships and finally the model provides us with the corresponding predictions. By trying various combinations of data, we observe the changes in predictions, but we can not tell why the model produces such results.

It is not like we are not able to see what is inside a machine learning model. Besides, once we actually do open that box, we will see that it is entirely made up of simple components that are easily understood in isolation.

The complexity and all the black spots emerge once we take the whole system in action where each individual component operating in accordance with its own rules, in response to the input. Understand why these components do what they do at any given situations is a black box indeed.

Going back to machine learning in the financial industry, many are concerned about black box underwriting models powered by the machine learning algorithms thinking they are uncontrollable and biased due to the hidden nature. This assumption is only partially true.

For example, the machine learning model embedded in an underwriting system predicts all Swedish women under 35 are delinquent. It is common that an underwriting system rejects any loan applications received by the borrowers with similar features. The system is biased, for sure.

But this is not the system’s fault. The nationality, age, and gender of a borrower in our case are unrelated to solvency. However, in the sample of data we used to build an underwriting system, all borrowers who are Swedish, female and age under 35 are delinquent. The system’s job is to find patterns that predict delinquency, so it will see this pattern, and it will always predict delinquency for Swedish women under 35.

The problem here is that our model was not validated properly. All the biases with black box models perhaps result of the mistakes in proprietary software, where customers are not allowed to inspect the algorithm used by the company or the data it was trained on.

Borrowers typically do not know what gets the thumbs-up or thumbs-down, and where their personal, non-business-related credit information — including FICO scores, for example — play a role. In many cases, neither do the companies who lack hands-on specialists to ensure the quality of their self-learning models. But once the companies have got themselves involved in machine learning techniques, it is their responsibility to understand it and make it transparent in the first place.

Takeaways

Fair lending is the result of comprehensive machine learning. There are hundreds of million people around the world who deserve credit but because the lenders do not have enough information on them, they are excluded from a credit box. Machine learning is a technological attempt to make a difference in solving this problem, expanding the financial capabilities of those underbanked.

Thanks to machine learning, online applications, and credit decisions are rendered in minutes. Real-time analysis of data means the company can scale up credit lines, or decrease credit limits, depending on the customer’s individual circumstances.

The cost of underwriting is considerably lower with a data-driven online model; thus, the lenders more willingly approve smaller loans, the type typically considered as “economically unfeasible.”

The important thing about machine learning is an ability to understanding whether the model is safe or not. The companies working at that space should understand that the machine learning model is not only data and algorithms — it is the accuracy of prediction. The perfect case is when a company can generate insights out of the predictions, not to follow it blindly.

Machine learning surely has a future in the lending space. Many FIs and FinTechs today scale up their technology capabilities looking for a reliable technology partner to leverage the power of data for them.

Armada Labs has over 10 years of AI underwriting experience, serving as a reliable backbone for the largest credit score companies like Experian and DataX.

Join us on your journey to the data-driven underwriting models — partner a savvy FinTech craftsman to take care of technology while you focus on business.

Everything You Want to Know About Machine Learning in FinTech

Machine Learning in Credit Risk Modeling

Machine Learning and Statistical Learning

Black Box Underwriting Models

Takeaways

Written by Sofiko Abeslamidze