Developing a bank statement analyser

How we achieved accuracy of over 90% after reading 800+ transactions

6 min readDec 14, 2017

At Inkredo, we perform flow-based credit assessment to determine the monthly repaying capacity of a customer. Our customers are small & underbanked retailers who are running a bootstrapped and consistently profitable business, yet they remain excluded from formal credit. Formal institutions have shied to lend to lower-middle income group because the cost-benefit analysis of lending and collections do not offset the cost of originating and recovery. There is no cost-effective measure to monitor income/solvency and ensure timely repayment.

The assessment involves calculating useful analysis from the bank statement of our customer. This task requires copy pasting every transaction from the PDF of bank statement (containing tens of pages with hundreds of rows) to an Excel file, cleaning the copied data, and then using Excel wizardry to perform some statistical operations. Imagine using Ctrl+C and Ctrl+V almost a thousand times every other day. As you might have guessed, this involves a lot of human interaction and typically takes us a day to complete a single bank statement.

**When you just want to copy a row but it selects the whole column**

With our growing user base, a solution was required to reduce the effort and time required. A smart solution to generate insights within seconds with minimal human interaction.

Input

Bank statement in PDF

Output

Sources of earning
Merchant transactions
Operational expenses
Recurring transactions
Debt (if any)
Default (if any)

Process

The Research

Before starting with the development, we tackled the problem manually. For each bank statement (from various banks), read all the transactions, highlighted keywords and assigned appropriate labels and categories to each. Then with the generated mapping created a set of keywords for each category with priorities assigned.

The dataset

A bank statement containing transactions from over six months of a person running a business is usually more than 20 pages long with around 1,000 transactions. Columns are generally of date, particular, balance, deposit, withdrawal, etc. For a specific bank, the result is pretty consistent and easy to play with, but every bank has its format for bank statements. Count of columns, positioning of columns, separators, text format and abbreviations vary.

The columns we require are

Date of transaction
Particulars
Deposit amount
Withdrawal amount
Closing balance
Cheque/Reference number

These columns are found in every bank statement. Example,

HDFC

IDBI

ICICI

The naming convention might be different, but the purpose of every column remains the same.

Created a dictionary called BANK_DETAILS that contains the position of the required column. Example,

Reading the bank statement

Reading tables from PDF documents is not an easy task. Even copying data from tables doesn’t work properly most of the time. Thankfully, there’s an open-source library available called tabula that can extract tables from a PDF with almost accurate results. We used its Python wrapper tabula-py for the data extraction.

Making the extracted data consistent

Table header from ICICI bank statement

Every page with transactions table of the ICICI bank statement consists of this header row. This row is useless for the system as we are only targeting transactions.

Aim: Remove header rows from the list of transactions.

Solution: From reading multiple transactions from numerous bank statements we realised that the closing balance column is always the last. So, a header can be considered as rows (why plural? see next task) starting from the first row till the row where closing balance is not null. Then, go through all the rows and if the row is a part of headers, remove it. In the end, we have rows without any header.

Transaction from an HDFC bank statement without the sensitive information.

In the image, we can see that particular can be in multiple lines but belong to the same row. Tabula cannot differentiate whether multiple lines in a row belong to the same row. It will treat them as multiple rows, and as a result, we get the following output:

# First line read from HDFC statement
['22/06/17', 'IMPS-7-RAHUL-HDFC-XXXXXXXX', 'XXXX7', '22/06/17', nan, '1,000.00', '14,904.08']# Second line read from HDFC statement
[nan, '8-XXXX', nan, nan, nan, nan, nan]

Aim: Convert the same particular from multiple rows into one.

Solution: The first line of every entry contains particular, date, balance, transaction amount and cheque number. Only the particulars can be multiline. So, a multiline particular can be between two date entries.

Credit, debit and default

As we saw in the columns of various bank statements, the differentiation between credit, debit and default is based on whether the entry is in deposit column or withdrawal column or in some cases whether mentioned as CR/DR.

Aim: Classify every transaction as credit, debit or default.

Solution: Classify all deposits as credit and withdrawals as debit. An event of default is defined when a withdrawal leads to negative closing balance and then immediately followed by a deposit of the same amount.

Example of a default transaction

Categorising transactions

To perform analysis on the bank transactions, we need to categorise every bank transaction. Categorizing enables us to perform category specific operations and answer questions such as “how much does he spend on operations?” or “what are the different channels of earning?”. A category can be ATM, Shopping, IMPS, NEFT, etc.

Aim: Categorise every transaction.

Solution: For every transaction, tokenise the particular and based on the occurrence and position of keywords assign a category.

Cashflow analysis

Now that we have read, cleaned and categorised transactions from the bank statement, it’s time to generate some insights. After all, what’s data without information?

Cashflow analysis helps in

Analyzing the spending, buying and saving behaviour of the user
Checking whether the user is doing any side business
Calculating growth of the business
Checking whether user has any running loans and their payment status
Calculating repaying capacity of the user
Analyzing recurring transactions

Overall analysis

This analysis gives an overall view of the total number and amount of credits, debits and defaults in the bank statement. Also contains a categorical breakdown of cash and non-cash transactions.

Monthly analysis

This analysis is a month-wise breakdown of the overall analysis of the bank statement. Helps in calculating the growth of the business.

Defaults analysis

This analysis shows the total number and amount of defaults in the bank statement along with the details of every default.

Recurring transactions

To understand the spending behaviour of the user we need to know the most common transactions. To answer questions like, “Are there multiple NEFT transactions to/from the same person/company?”, “Is he an IRCTC agent?” etc., We used Ratcliff-Obershelp algorithm to club similar transactions with more than 85% similarity. For better results, removed numbers, special characters from the strings.

Note: Code snippets mentioned above are pseudocodes to demonstrate the idea and may not contain all the edge cases.