Leveraging Open Banking through Data Science

Gustavo Martins
Marionete
Published in
4 min readNov 12, 2020

Overdraft prediction for pre-emptive loan offering

Open banking allows financial services companies to have a full view on clients payments, accounts, mortgages, credit cards, savings etc. Image by author.

Table of contents

  • Overview
  • Open banking opportunities
  • Solution approach
    º
    Dataset
    º
    Baseline model
    º Machine learning model
    º
    Metrics
  • Results and discussion
  • References

Overview

An approach for overdraft prediction is analysed here.

Image by author

With the increasing adoption of open banking applications by customers, new opportunities are emerging for the banking sector, spanning through value added services, cost savings, user experience and more.

A holistic view of client’s payments, credit profile and assets portfolio allows tailored product offering and better customer segmentation.

Knowing in advance when the client’s balance will be in overdraft enables business to make loans offering ahead of competitors.

Before open banking someone could argue that offering a loan before an overdraft would not be beneficial for the bank, since in the latter the bank receives more revenue through heavy fees and high interest rates. With this new digital ecosystem that no longer holds true, with companies competing to provide first a better service.

Open banking opportunities

The growing open banking sector is expected to generate £ 7.2bn by 2022 in UK [1].

The adoption among retail customers is expected to reach 64%, and a 71% share of SMEs are expected to use open banking by 2022 [1].

Solution approach

Dataset

In order to illustrate this analysis a banking transactions dataset was used [2].

Image by author
Image by author

Baseline model

A baseline model is a simple, explainable approach that does not require (much) parameter tuning.

The objective is to assess if a simple approach would achieve satisfactory results.

A machine learning model requires productionization, deployment, monitoring and versioning. Thus it is necessary to evaluate if a more elaborate solution is required.

The baseline model premise is: if the difference between moving averages (3 months) of withdrawals and deposits is larger than the balance, it is considered as an overdraft.

Machine learning model

The df DataFrame containing the transactions was transformed for feature engineering:

only_debit = df['flow'] == 'debit'df.loc[only_debit, 'amount'] = -df.loc[only_debit, 'amount']df = pd.concat([df, pd.get_dummies(df['flow'])], axis = 1)df['withdrawal'] = (df['amount'] * df['debit']).abs()df['deposit'] = df['amount'] * df_all['credit']df_daily = df.groupby(['account', 'date']).agg(
amount = ('amount', np.sum),
withdrawal = ('withdrawal', np.sum),
deposit = ('deposit', np.sum),
transactions = ('transaction', 'count'),
debits = ('debit', np.sum),
credits = ('credit', np.sum),
).reset_index()
df_daily = df_daily.groupby('account').resample(
'D',
on = 'date',
closed = 'left'
).mean()

For the resampling SMOTE was chosen [3], implemented in imbalanced-learn. The machine learning algorithm selected was XGBoost.

sampling = 0.05resampler = SMOTE(
sampling_strategy = sampling,
k_neighbors = 3,
random_state = 123
)
params = {
'max_depth': 5,
'n_estimators': 10,
'num_parallel_tree': 1,
'learning_rate': 0.3,
'reg_lambda': 1,
'reg_alpha': 0,
'subsample': 1,
'min_child_weight': 300,
'max_delta_step': 0,
'objective': 'binary:logistic',
'tree_method': 'hist',
'grow_policy': 'depthwise',
'seed': 123,
}
algo = XGBClassifier(**params)model = Pipeline([('res', resampler), ('algo', algo)])

Metrics

With a small quantity of events (i.e. when a client’s balance goes from positive to negative), less than 1% in total, this type of challenge is also known as anomaly detection.

Therefore accuracy is not the most adequate metric, because if the model predicts everything as not an event, it will have an accuracy score of 99%.

A more suitable metric is F-score, which is the weighted harmonic mean between recall (penalizes missed events) and precision (penalizes wrong predictions).

Image by author
Image by author

Different weights ß can be chosen according to business requirements.

F3 was selected, weighting recall 3:1 higher than precision.

Results and discussion

The results are presented here:

Image by author

The XGB results are superior than the baseline model.

As stated previously, depending on business requirements, other weights can be applied to the F-score. An approximation can be used as the ratio between revenue from a product sales and the cost of a product offering, multiplied by the conversion rate.

The notification window can also be adapted to product requirements:

Image by author

References

[1] PWC, The future of banking is open (2017), Open banking report

[2] P. Berka and M. Sochorova, PKDD’99 Discovery Challenge (1999), Guide to the Financial Data Set

[3] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique (2002), Journal of artificial intelligence research, 321–357.

--

--