How did Airtel develop a management system that deals with Fraud smartly and addresses Deduplication — Part 2
Recap:
In the previous part of the blog, we deep-dived into the various approaches and examined which of them were feasible and reliable. In this part, we will discuss the details of the approaches that we selected and their technicalities which would help Airtel in building the smart fraud management system.
Probabilistic Model (N-Gram) based Key-value Store and Hash key-based storage:
As part of the solution, we have used a complete Open-Source solution instead of a licensed solution.
Following are the libraries and open-source packages that were used:
N-Gram Model: It is a probabilistic model used to predict the next item in sequence in the form of an (n − 1)–order. Two benefits of n-gram models are simplicity and scalability — with a larger n, a model can store more context with a well-understood space-time trade-off, enabling small experiments to scale up efficiently.
Postgres Db: The Database used to store all information related to the Customers is currently being stored in Postgres from where the hashing will happen.
Golang Libraries: We are currently using multiple Golang Libraries to implement the Machine Learning Models. Golang is more flexible than Python and has better performance than Java.
Below are the Golang libraries being used in the Solution:
Badger DB: Badger DB is used for storing the Hash generated by the System. Once the Request hits the system, the N-Gram is generated and matched against the stored Hash.
Buffalo: Buffalo is Golang Framework for hosting the various APIs created as part of the Dedupe system. Buffalo has its ORM, which supports all Relational Databases.
Similarity: Among the list of similarity algorithms available for our use-case Jaro-Winkler algorithm were seem to be the best fit. The similarity score that this algorithm-generated even when the words or the characters are jumbled was outstanding.
Process Flow:
The Process flow diagram is explained below:
API Validation: For a given request, check all required tags and validate the fields for the format, allowed values etc.
- If valid, proceed further. Generate a UUID for the request if not already present in the request.
- Else, return the validation error code and message.
Pre-processing: Contact name and address clean-up (remove special characters, repeat words, convert complete forms to abbreviations etc.).
Get In-transit Matches: Fetch the connections from the In-transit table corresponding to:
- Proof Of Identity(POI) Id (if POI Id exists in the request) — This is an exact match against all the records present in the In-transit table.
- Similar name and address combination — This is an approximate match based on Jaro-Winkler distance against all records present in the Transit table.
Upsert In-transit Record: Insert/Update record to In-transit table based on below logic:
- If the In-transit table already contains a record with the request’s UUID, then update the record with the current request’s data.
- Else, insert a new record in the In-transit table with request data and UUID.
Get POI Matches: Fetch the connections from the Contact table, which match the POI Id of the request. This is an exact match against all the Active records in the Contact table.
Get Name + Address Matches: Fetch the connections from the Address and Contact table, similar to the name and address combination of request.
Further Considerations:
1. Since addresses are stored in separate tables based on the first two digits of Pincode, this matching is done on a specific Address hash file.
2. Input request will be pre-processed in which we will remove the spaces and special characters and change them in uniform case.
3. The n-Gram will be generated from the Request Address, and the same will not include City, State and Country Details.
4. The Grams will be searched in the Hash Table based on the first two digits of the Request PIN Code.
5. All the Matching Values of Key will be fetched, and then the same will be stored in a Map.
6. Then using the Go Lang Jaro-Winkler algorithm, we will be checking the Similarity between the Fetched Addresses and Input Addresses.
7. The Similarity Ratio will be stored in the config at Circle Level.
8. If the Similarity Factor is higher or equal to the value present in the Config File, then such Contact Id against such addresses are used for Querying on Postgres DB.
9. Using address only, identify similar records above a similarity score (Circle wise configurable value).
10. For records with similar addresses, match contact names with different combinations of first and last names.
Dedupe Result: Combine all matched records from steps 3, 5, and 6 above and remove duplicate matches. Send the details of the matched records as the response to the API. Also, store the dedupe result against the request’s UUID to fetch the same results in future.
Fraud Detection: At the single request received at our System, we are doing the Deduping and Fraud Detection simultaneously. For Fraud Detection, we detect both types of Fraud, Financial and Non-Financial.
Financial Frauds: Once we get the list of matching records, we fetch the Outstanding Details for the Inactive Accounts and check for the values greater than 0. The total sum of outstanding goes in the header, while all the details for such an account are sent at the child level.
Non-Financial Frauds: Once a customer comes under the scrutiny of authorities due to any Criminal Activities such as Ransom, fixing, etc. Numbers and the data are shared across the Service Providers. Post this, we get a list of matching records, we check for such flagging if the customer falls under such frauds, then the same is sent in response and flagged as Y at header level along with complete details at child level.
Architectural Considerations:
Since we need high-availability with Active-Active site setup after going through multiple architectures we have implemented the below Architecture for our use-case: -
- Application is not exposed to the public Internet, only the Internal workgroup will access this service
2. From the Internal workgroup the traffic is routed to proxy
3. Based on the Active-Active setup the traffic is routed to an appropriate node
4. Data retrieval from KV store as well as from the Postgres are done and the App Server Node does the computation
5. Prepares the Similarity scores and the response is sent back to the user
As part of the program following were the learnings: -
1) Since the DB store size was high with respect to specific first two digits of the Pincode we have written our own custom wrapper inspired by the technical papers from the dgraph blog and prepared a clustering framework for badger DB
Below is the architecture that is followed currently
Business Impact:
Soon as the system went online, during the initial testing phase on the Production data business, we were able to identify the fraud customers, something that was not possible in the old system.
In addition, the engineering division provided additional inputs to the business which made it easy to flag those customer records and check whether those are genuine or not.
This is how we were able to create a system that addresses smart fraud management that also addresses deduplication.