How did Airtel develop a management system that deals with Fraud smartly and addresses Deduplication — Part 1

Ashish Santuka
Airtel Digital
Published in
5 min readApr 11, 2022

With the advent of the era of cloud computing and data management having an intelligent management system that tackles a multiplicity of conditions with the purpose of handling fraud and addressing customer deduplication is very critical.

This article describes how Airtel’s fraud management-related requirements are implemented and the application of customer dedupe techniques.

For discussing the solution in a detailed manner we have divided the article into two parts: -

1) Business requirements, key challenges, and list of approaches

2) Final solution and architectural learnings

In the first part, we focus on the business requirements and key challenges involved and will list down the approaches undertaken to implement the said business requirements.

Business Requirement:

As part of the business requirement, there are certain pre-conditions that need to be evaluated in order to identify whether a prospective or an existing customer can be flagged as fraud or not.

A glimpse of the conditions includes: -

1) A prospective customer might be an ex-customer of another line of business and might be a defaulter.

2) A prospective customer might be an ex-customer of the same line of business and might be a defaulter.

3) A prospective customer might be an ex-customer of the same or of another LOB whose details have been identified as a fraud as per the Fraud Management system.

4) A prospective customer might be an ex-customer of a different service provider but has been identified as a fraud as per the Fraud Management system.

5) A prospective customer might have used similar identity/address proof to own more than the maximum number of connections which violates the criteria.

Amongst the above, cleansing and finding a de-duplication of addresses across lines of business was a major challenge that was posed to the engineering team.

Although there is a current system that does a subset of this, there was a need for a system that would not only accommodate the current business requirements but also addresses the future perspective use-cases as well. Hence, there was a need to evaluate and understand the challenges better to arrive at a holistic solution.

Market Research:

As part of this exercise, we identified and analysed a list of systems that were available to handle the deduplication of address matching.

We found a paper that matched the desired criteria. Below is the synopsis of the research paper.

Mining Postal Addresses:

https://www.researchgate.net/publication/220969002_Mining_Postal_Addresses

Abstract of the paper:

This paper presents FuMaS (Fuzzy Matching System), a system capable of efficient retrieval of postal addresses from noisy queries. This system has many possible applications, ranging from Datawarehouse de-dumping to the correction of input forms, or the integration within online street directories, etc. This paper presents the system’s architecture along with a series of experiments performed using FuMaS. The experiment results show that FuMaS is a very useful system when retrieving noisy postal addresses, being able to retrieve almost 85% of the total ones. This represents an improvement of 15% when compared with other systems tested in this set of experiments.

Besides the above-highlighted research paper, we also found a lecture series on Mining massive datasets from Stanford. The link to that material can be found here :

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf and http://snap.stanford.edu/class/cs246-2020/slides/04-lsh_theory.pdf

This chapter provides details on how similarity in documents can be managed and what techniques can be used to identify similarities between texts. Our Solution is inspired by this lecture series.

Based on our research we have tried the following approaches :

1) Word to Vector based approach

2) Using String Similarity libraries from Postgres

3) Probabilistic Model (N-gram) and Hash key-based KV Store

Challenges Encountered:

1) Ambiguity in Indian Addressing Methodology: Compared to the addressing methodology followed in countries like Singapore, US, UK, Canada, and Australia, India does not have address standardization which should be done country-wide to remove ambiguity.

2) Unclean Data: When we started examining the existing data we came to a conclusion that there is a need to clean the data before loading it onto the system.

Examples:

Airtel Blg, Delhi
Airtel Building, Dilli
Airtel Building, New dilli
Airtel Building, New Delhi
Airtel Rd, Delhi
Airtel Road, Delhi
Airtel St, Delhi
Airtel Street.
The above examples are just one snapshot of the data that we have to tackle.

Word to Vector based approach

The above figure explains the Word to Vector process and approach.

Pros:

1) Python was the language of choice, hence, there was a variety of libraries available to run a quick proof-of-concept.

2) Usage of NMSLIB finds similarity better on indexed vectors.

Cons:

1) FSAISS library throughput was better on GPU compared to multi-core processor. Hence, this increases the Operational and initial costs.

2) We could not use higher string-matching algorithms like Jaccard/Jaro-Winkler and likewise.

String Similarity Libraries from Postgres:

The following steps were applied in this approach :

1) Load the address data

2) Generate trigrams on those texts using Postgres libraries

3) Leverage the pg_similarity library to run a list of string similarities in db level to derive string similarity

Pros:

1) Reliable, scalable

2) Using the pre-defined library to generate string similarity score and abstract the same from the application layer

3) Throughput was better since the entire logic was executed via SQL query

4) Replication and clustering are native to DB hence high availability is achieved

5) Database could load 300 million records seamlessly through the language of our choice and leveraging existing ETL tools and so on

Cons:

1) Due to the substantial number of recordset retrieved against which the string similarity score to be calculated takes significant time more than 12 seconds

2) Since the libraries which have been used as part of Postgres were not part of the standard build, as a result getting support for the same was deemed unfit for Production

Probabilistic Model (N-Gram) based Key-value Store and Hash key-based storage:

In this approach following were applied:

1) Load the address data

2) Preform pre-processing and clean the data before generating n-grams

3) Generate n-grams from the address text

4) Generate the hash key for n-grams and store the same in the has-key value store

5) When an incoming request comes for evaluation, pre-process, generate n-gram and load the appropriate DB belonging to the first two digits of the Pincode

6) Query the respective values from the has-key store and generate a similarity score

Pros:

1) Reliable, scalable

2) Using the varied libraries available to calculate the string similarity scores the values are derived

3) Throughput was better since we have been using a graph database

Cons:

1) There is a need to write a custom wrapper to build a cluster. However, this is not a concern because we have pre-defined libraries and patterns available to support this architecture

2) Maintaining quick response and high throughput was a challenge

Thus we discussed the various approaches which could be the best for us at airtel and scrutinised which of them were feasible and reliable.

In the next part of the blog, we shall show the technicalities of the solution and some architectural considerations.

--

--