Published in


How Banks Can Become More Data-Driven

Photo by on Unsplash

Data has never been more available, everyone’s talking about data. Data is the key to unlock insights and time and again it is proved that organizations that are data/information-driven end up being more profitable in their business. The open data movement has also evolved, and more businesses and organizations are releasing data into the wild. The result is an overwhelming desire in the business community to use data to get smarter about their operations and deliver results in every industry, across every department. In fact, for the year 2018, 98.6% of firms aspired to a data-driven culture. The excitement has led a lot of organizations to invest heavily in business intelligence tools and the fleet of data scientists who will use them to spin gold out of raw data.

For banks being data-driven is very crucial to improve CRM, mitigate risks involving in transactions, and for providing loans. In this article, I will provide a unique methodology that banks can use at their disposal to improve cross-selling activity and predict customer behavior.

It’s obvious that banks manage a large and versatile group of customers, this convinces me to present a big data framework for tackling the case study. The general definition of Big Data refers to “the 3Vs” -Volume for large amounts of data, Varieties for data creation speed, and Velocity for unstructured data that develops. As technology grows, big data has led to data-based decision-making, also known as evidence-based decision-making. Big data expands the technology towards data analytics is the process of using analysis algorithms or models running on powerful supporting platforms to uncover potentials concealed in big data, such as hidden patterns or unknown correlations with minimum processing time requirement. As we have mentioned leveraging the current bank platform, we will use the combination of Hadoop, Tera Data RDBMS, Tableau dashboard visualization, and Talend Data integrator.


Hadoop is an open-source framework that deals with distributed computing of large datasets across clusters of computers using simple programming models. Currently, banks have been leveraging Hadoop for big data storage with Hadoop Distributed File System (HDFS). Big data tends to be diverse in terms of data types, and a data-type agnostic file system like HDFS could be a good fit for that diversity. Also, many of the complex data types we associate with big data originate in files, and using a traditional database management system (DBMS) is a problem when dealing with big data because of time-consuming processes for data query and integration.

Tera Data RDBMS

Tera Data RDBMS was the very first RDBMS that is linearly scalable and supports parallel processing capabilities. Teradata is designed mainly for data warehousing and reporting use cases. Banks can gradually move data from the oracle data warehouse to Tera Data RDBMS as a new data warehouse and in parallel use Teradata to store analytics result for visualization purpose via Tableau.

Aster Analytics

Aster Analytics is a massively parallel processing database designed for online analytical processing (OLAP), data warehouse, and big data tasks. It manages a cluster of commodity servers that can be scaled up to hundreds of nodes and analyze petabytes of data and Aster performs 25% to 552% better than Pig and Hive. Aster Analytics is a platform that can run over the Hadoop execution engine. Aster comes with an analysis function accessible using SQL and R languages. Aster-on-Hadoop is designed to work well with CDH and Teradata RDBMS, it has a connector to Hadoop and Teradata platforms that ease in and out data transfer.

Talend Data Integration and Tableau Visualization

Talend Data Integration is a platform orchestration tool that lets users define data movement and required transformation processes across many platforms, and it is used to integrate data between the current data warehouse to Hadoop and between Hadoop layers. The Bank also used Tableau for analytics result visualization.

Analytics Model

This article will also design business rule of analytics model, for example, propensity model that can be referred to as a statistical scorecard that is used to predict customer or prospect behavior and in this research is for cross-selling activity, as defined by previous research, Cross-selling pertains to efforts to increase the number of products or services that a customer uses within a firm. Proper implementation of cross-selling can be achieved if there is an information infrastructure that allows managers to offer customers products and services that tap into their needs. On top of the propensity model, we also apply special tagging to improve classification and likeliness to take a loan product. In this article, we will use the Random Forest model for propensity model and SAX special tagging also additional filter defined by
the case study bank for regulatory and internal risk

Random Forest

Photo by Luca Bravo on Unsplash

Random forest (RF) is a non-parametric statistical method and it is also a suitable method for Big Data analytics. The basic constituents of random forests are tree-structured predictors and each tree is constructed using an injection of randomness. Being unlike traditional standard trees in which each node is split using the best split among all variables, random forest split each node by using the best among a subset of predictors randomly chosen at that node. Relative to a decision tree, AdaBoost, Neural Network, SVM, etc., RF has higher prediction accuracy, better noise tolerance, and is robust against overfitting. Research conducted by determine that a random forest algorithm can be used in the large-scaled imbalanced classification of business data, and it resulted that the random forest algorithm is more suitable in the product recommendation or potential customer analysis than traditional strong classifiers like SVM and Logistic Regression. There are several factors to be considered in choosing machine learning algorithms. Some of those factors are the size of training datasets, dimensionality feature space, linearity, feature dependency, and required processing power. Random forest able to discover more complex dependencies in a non-linear machine learning problem, robust, not be affected even though the feature is not scaled nor highly correlated, and able to solve binary classification problems by decision tree and because it is a bagging algorithm, it can handle high dimensional data. I will not go in-depth about explaining this algorithm in this article. I will leave a link here for those who are eager to know more about the algorithm.

SAX (Symbolic Aggregate Approximation)

Symbolic Aggregate approximation (SAX)

Symbolic Aggregate Approximation (SAX) is a function that transforms original time series data into symbolic strings, which are more suitable for many additional types of manipulation, because of their smaller size and the relative ease with which patterns can be identified and compared. Time series is a collection of data observations made sequentially over time. SAX splits data into several intervals and assigns each interval into an alphabetical symbol. The output of this function is letters represent a pattern occurring over time. The symbol created by SAX corresponds to the time series features with equal probability, allowing them to be compared and used for further manipulation with reliable accuracy. SAX has many advantages over other symbolic approaches such as dimensionality reduction power and lower bounding the distance.

Case Study Interview

An Interview with the project manager of a bank planning to implement a big data project in their department, the purpose of the interview is to identify current pain points in cross-selling activity in the bank. The Current cross-sell activity is described by the below diagram:

Current Cross-Sell Activity

Risk division is in charge to generate cross-sell leads every time a business unit requested a potential customer to be included in a campaign or sales event. Risk division will generate leads based on low-risk segment, product criteria, limit assignment, and scorecard filtering. All data is calculated
manually by risk division so it takes a long time just to generate leads, which is impacted by the delay of the business unit’s campaign. Also, low intensity of communication between business unit and data management division is a pain point in the process which is resulted in a minimum view of business factor to produce the best leads based on business potential value. Based on the interview, several products come from two business units to be tested in the analytics system, first is the consumer loan unit, which the product includes personal loan, small business and micro banking loan / (CBC, BB, KUM), mortgage, auto loan, housing loan / KPR, and loan without collateral / (KTA, KSM). This unit is suffering from a 1–5% of take-up rate from the leads and limited base customer choices because the parameter is only risk criteria, not potential value, and the unit aims to increase take-up rate by better campaign and targeting specific customers for cross-selling loans. The second unit is credit card business because bank XYZ (not disclosed here) credit card offers many features as additional benefits such as power cash which is used available credit limit as loan, the power bill is monthly utility bills directly from credit limit, and insurance product, however, the penetration is still low and the challenge is to identify the right customer at the right time to offer, in this paper we will focus on Power cash (PWC). Bank believes that implementation of big data analytics can improve their marketing campaign and cross-selling activity of their loan products.

Big Data Analytics Design

The above diagram depicts a layer of analytics design that the bank can utilize to ensure fast-moving data is regularly processed and implemented as and when required. The data can further be divided into “hot data”- the one that needs to be used quickly and “cold data” — the one that can be stored for longer periods to be utilized later for making reports and analyzing.


[1] Random Forest algorithm —

[2] A Novel Trend Symbolic Aggregate Approximation for Time Series —



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store