Detect shared bank accounts: A peer-to-peer entity matching problem

Published in

PayLead

7 min readMay 29, 2024

Introduction

PayLead processes millions of transactions daily from hundreds of thousands of distinct bank accounts. Our internal algorithms analyze each transaction to determine the potential redistribution of cashback to the user’s bank account according to specific criteria.

However, due to PayLead’s partnerships with multiple financial institutions, a unique bank account could be enrolled in several cashback programs. For instance, this duplication can occur when a bank customer subscribes to a cashback program through their bank’s application and an account aggregator, which also partners with PayLead. As a result, we would receive duplicate transactions that seem to originate from different bank accounts in our system but are in fact, carried out by the same real user.

This issue also occurs with joint accounts, when two users independently enroll the same account into a cashback program.

PayLead’s Data & Machine Learning team has developed an algorithm to identify duplicate bank accounts. This prevents the same transaction from being rewarded multiple times via different accounts. This ensures the integrity of our platform and the accurate attribution of cashback to the legitimate accounts.

Bank account deduplication and reward attribution workflow

1/ Scalability and accuracy limitations of early detection

In the early stages of duplicate account detection, we developed an SQL-based solution that matched some transaction attributes using predefined similarity scores to determine if two accounts were duplicated. While this deterministic approach was a step in the right direction, it quickly became evident that it had significant limitations, particularly as our user database grew.

Scalability Issues

As the number of PayLead users grew, our SQL query-based solution struggled to remain cost-and time-effective. The process involved scanning vast amounts of data daily on a PostgreSQL database, which is directly used by our API. This led to substantial database locks, causing delays and inefficiencies significantly impacting our overall system performance.

Accuracy Concerns

Another critical point was the accuracy of our initial model. The predefined similarity scores used in the SQL queries were not sufficiently robust, leading to a significant error rate requiring manual moderation using our own back-office tool. This time-consuming and labor-intensive process slowed down our operations and compromised the integrity of the cashback attribution system.

To address these challenges, we recognized the need for a more sophisticated and scalable solution. This led us to explore machine learning techniques that could handle the complexity and volume of our data more effectively while improving the accuracy of duplicate account detection.

2/ Build a ML peer to peer entity-matching model

To tackle this issue, our team has developed an entity-matching model. This widely used machine learning technique is designed to identify and link different records that correspond to the same entity across various data sources.

Challenges of this type of model

There are two main challenges in deploying such a model:

1) Discrepancies between different versions of the same entity

A single bank account onboarded through two different financial institutions (the bank itself and an aggregator for instance) could have only a small portion of identical transactions between its two instances.

Several factors contribute to these discrepancies:

The same transaction reported by two different financial institutions might reach us with slight variations (like delays between the purchase dates), making strict transaction matching impractical.

Some bank accounts, especially those connected to an aggregation service, can be temporarily deactivated. During this period, we do not receive any transactions, but we continue to receive all transactions through the other institution. In addition, not all financial institutions share the same transaction history with us.

Differences in data sources imply that two accounts can be identical even if less than 50% of their transactions coincide, which requires extremely precise and flexible detection methods.

2) Algorithmic complexity

The computational complexity of an entity matching problem increases quadratically with the number of data points involved. For PayLead, which processes transactions from over a million active accounts, the possibility for any account to be linked with another means that the complexity reaches N(N−1)/2. This results in approximately 10e12 comparisons.

Several methods are used to reduce the complexity of an entity-matching problem. The most common involves establishing a ‘blocking key,’ which is a characteristic that must be strictly shared among entities for them to be considered identical. In our scenario, financial institutions must provide the bank name associated with the account from which transactions are sent. This allows us to compare only accounts from the same bank (one possible blocking key), reducing the algorithmic complexity by 1000.

Model fitting with efficient dataset annotation

To build such a model, it is preferable to have a labeled dataset, i.e., pairs of bank accounts labeled appropriately according to whether they are identical or not. The problem is that annotating such a dataset is hugely time-consuming, as it would require significant human effort, and more than 99.99% of the annotated pairs would be true negatives, i.e., accounts that are not identical.

However, some financial institutions share the IBAN associated with a bank account. Although this information is encrypted in our system, it serves as a unique identifier for a bank account. It allows us to gather a dataset of a few thousand pairs of identical entities that we can use to test different models.

We use this annotated dataset to rigorously evaluate the performance of our model, ensuring its accuracy and reliability before deployment.

Feature engineering & similarity matrix

To model our problem, we have chosen to vectorize each bank account into several attributes. We mainly use the different transaction amounts of each account to create vectors, where each attribute corresponds to the number of transactions made for a specific amount.

Once this process has been repeated for each active account and grouped according to the defined blocking key, we obtain several sparse matrices, one for each bank studied.

Then, to achieve similarity matrix computation, we conducted dozens of tests on our previously constructed training dataset by varying different hyperparameters of the model, such as the features normalization, the distance measure, and the similarity threshold from which we consider two accounts to be identical.

Having ultimately opted for cosine similarity, we use an open-source Python package developed by engineers from ING Analytics Wholesale Banking Advanced Analytics, named sparse_dot_topn, which is optimized for performing large-scale sparse matrix multiplication, particularly through extensions in C++.

Detect shared bank accounts: vectorization and similarity computation.

By applying a filter on similarities above a specific threshold, we generate a list of shared accounts pairs as the output of this model. Once completed, another crucial task is performed: detecting similar transactions within the same pair of shared accounts (which is not covered in this article).

This pipeline ensures that a user who has made a single eligible transaction is rewarded only once.

3/ Technical Architecture

Thanks to the work of PayLead’s Data Engineers and Infrastructure team, we have a technical stack that enables the large-scale industrialization of this type of Machine Learning model.

A feature store database is stored in a ClickHouse Data Warehouse, where we centralize all the essential features of the various Machine Learning models. In particular, this is where we calculate the features linked to the various bank accounts based on the analysis of their transactions.

In this model, the raw data (bank transactions) is transformed into features using dbt, a data transformation tool.

The matrix calculation is done in Python, and the whole process is orchestrated using Dagster and various assets.

Shared account peer-to-peer entity linking workflow

This implementation enables us to solve an entity-matching problem requiring a billion comparisons in just a few minutes of computation.

Conclusion

In this article, we explained how PayLead ensures the reliability of its platform by detecting duplicate accounts belonging to the same real user thanks to machine learning techniques and a dedicated infrastructure. This ensure that an eligible transaction is rewarded only once, even if the user’s bank account is enrolled in several cashback programs via numerous financial institutions.

However, this entity linking model reaches certain limits when a user onboards their own account via more than 3 of PayLead’s partner financial institutions, creating clusters of shared accounts. We will see in another article how we have adapted this pair-by-pair model to a clustering one.

May 2024, Pierre-Louis Danieau, Data Scientist.

Paylead: Fintech Seamlessly Embedding Loyalty into Financial Services.
We leverage bank transaction data to power a SaaS platform for banks and retailers across Europe that delivers seamless loyalty and engaging reward experiences when people bank, shop and pay. Join us !

Disclaimer : The content of this article is for general informational purposes exclusively. All information is provided in good faith; however, PayLead makes no representation or warranty of any kind, express or implied, regarding the accuracy, adequacy, validity, reliability or completeness of such article. PayLead excludes any responsibility arising from this article.

Intellectual property rights held by PayLead protect all information in this article. Consequently, none of this information may be reproduced, modified, redistributed, translated, commercially exploited, or reused in any way whatsoever without the prior written consent of PayLead.