Solving the conundrum of identifying lenders in a bank statement

Kumar Tanmay
May 1, 2019 · 6 min read

If you were a wealth manager or a lender analysing hundreds and thousands of narrations in bank statements every month, what would be most common data points of your interest?

  • bank transfers
  • ordinary financing
  • tax payments
  • income and spending patterns
  • recurring transactions, and
  • days of non-sufficient funds (NSF)

All play a vital role in assessing the financial health of your customers.

Identifying lenders in the bank statement is not just one small element of the large tapestry that is bank statement analysis. Before this, the bank statements must be machine-readable. Although a PDF or a scanned statement is an electronic file, such files are not machine-readable. If such files are parsed, they must be reconciled and checked for fraud. In this blog post, we’ll focus on identifying the lenders in a bank statement and what makes it difficult for machines?

Identifying lenders in a bank statement

One of the most sought after data points from a bank statement analysis is identifying lenders and loans running in the account. This is where competing lender activity must be identified in order to predict net excess cash flow and ensure sufficient funds. It empowers wealth managers and lenders to answer the holy grail of all questions,

“Will my customer have enough money to support withdrawal of $$$?”

One of the quicker ways to identify lenders is through text analysis. However, this is easier said than done. Many algorithms use time series analysis of cash flow, i.e. is this transaction a part of pattern of transactions.

One of the easier methods to identify lending activity in a bank statement is to identify auto-debits and then perform text analysis on the party involved to identify whether it is a lender. However, we cannot rely on it. There are other issues and challenges as described below:

  • Truncation/ Deletion: Names can be truncated. e.g. The name of the party deducting EMI is completely deleted from this narration. Who is the lender?
Image for post
Image for post

NOTE: The narrations and the corresponding amount is for supporting the statements only. The narrations have been handpicked from different statements.

  • Concatenation: Words can be concatenated arbitrarily. Algorithms need to be trained to identify lenders from umpteen combination of concatenations.
Image for post
Image for post
The narration could be simplified if it was RBL only instead of concatenating with ‘Retail Asset Dept Of’
  • Ambiguous context: In some cases, lexical analysis is not enough. A semantic approach is required. What meaning do you derive from Bajaj Finance crediting an amount on multiple occasions?
Image for post
Image for post

We have another example below. It’s understandable that the first transaction corresponding to LIC Housing is a loan but how do you make sense of the withdrawal of Rs.2,481.00 on 5th of every month — is it a premium or a LIC EMI?

Image for post
Image for post
  • Name mangling: In some cases, names are arbitrarily concatenated and abbreviated to create a text that is recognisable for humans but less for machines. e.g. This is an ACH mandated auto-debit for RBL Bank but the narration of Retail Asset Department of RBL Bank is concatenated and abbreviated.
Image for post
Image for post
  • Computational explosion: Another prevalent case is that the lender’s name can begin on any character of the narration line. Algorithms should be trained to scan for word boundaries because of concatenation. This implies a very expensive comparison of all permutations and combinators of each lender in each character position on the line.
Image for post
Image for post
  • Disbursement of personal loan: There are many organisations that lend to their sister organisations and their employees during bad times and offer an emergency loan. Although they are not competing lenders, it does affect the repaying capacity of the merchant.
Image for post
Image for post
Instances of disbursement of loan in different bank statements

How a Dubai-based bank has solved it?

The problem of identifying a lending activity could be simplified if transactions were also expressed as codes. e.g. 714 is for Online local fund transfer while 985 is for fund transfer charges in the narration below.

Image for post
Transaction code in a bank statement from Dubai

While this problem might be tractable on a real-time basis because there is only countable lending activity in a bank statement. But wealth managers and lenders analyse thousands of lines of transactions daily. It is impossible to analyse millions of lines of transactions every month using a primitive eyeballing approach or even running a crawler on a table containing a list of all lenders in the country. Don’t you think this approach would require trillions of string comparison and humongous compute time in an optimistic scenario?

The name matching technology is one part of the problem. How to apply it is another part. This is why name identification in a narration is such a big problem. At Inkredo, we are combining computational elements that range from semantics and machine learning all the way to big data processing.

How is Inkredo approaching the truth?

  1. Keyword Mapping: Having a mapping of keywords and categories is a good starting point. POS can be mapped to Shopping category while ATW can be mapped to ATM category. Using keyword matching, we can identify the lender. In this case, it is a housing loan sanctioned by LIC. But, this approach also has its drawbacks. Narrations don’t appear this neatly, they can be abruptly abbreviated and concatenated and can appear like BAJAJFINEMI in which case no keyword would be found and how do you know if BAJAJFINEMI is same as BAJAJ FINEMI and BAJAJ FINANCE EMI.
  2. Transaction Type: Categorising solely on the basis of narration may not cover all the cases. Type of transaction can also be introduced in the categorisation process. A keyword may have different meanings when it is a part of a credit, debit or default transaction.
  3. Prioritisation: A narration can contain keywords that may belong to two different categories. In that case, how do you decide which category should be chosen? You can assign priorities.
  4. Pattern Recognition: A single instance of a transaction may not convey much but a pattern can convey a lot. Let’s take the example of Salary transactions. If the narration contains a keyword like SALARY, SAL or SLRY, then you can categorise based on your keyword matching algorithm but this is usually not the case. Salary credits, many a time, appear like one of the following
Image for post
Image for post

Note that there is no instance of a valid keyword. In such cases, you can find a pattern. Multiple NEFT credits from the same involved party is a potential source of earnings and can be categorised as a Salary transaction even when there is no such keyword present in the narration.


If you are interested to solve problems around the movement of money across the world, then we’re hiring!

Thanks to Samkit Jain for reading the initial draft and contributing to it.

Zodhana

Honest products to accelerate finance

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store