Analysing and Classifying Bank Transactions

Published in

Capchase Tech

5 min readMar 1, 2021

Transactional data is a rich source of information. At Capchase we analyse bank transactions to better understand businesses and make informed decisions. For example, from bank transactions we can get a picture of their volume of revenue and operational expenses. In turn, these enable us to compute metrics such as their cash burn rate and obtain an estimation of their runway.

However, in order to extract all this information from the data, the transactions need to have been labelled appropriately, so that we can separate revenue from expenses, for example. While this is an important task, it is also a very tedious one, and we have been experimenting with a tool to help us label the transactions: a transactions classifier.

At first instance, the classifier nurtures from our observations from the process of labelling transactional data. For example, we have observed that investment inflows, such as equity rounds, are usually very high amounts, abnormally high compared to other types of inflows. In statistical terms these are known as outliers. In addition, investment inflows are usually not recurring, whereas client revenue can be. Ultimately certain keywords that appear in a transaction, especially company names, are very indicative of the type of transaction.

To feed these indicators into a classifier we need to somehow quantify them. For outliers there are many statistical approaches, such as the interquartile range, where an amount that is sufficiently far away from most of the observed amounts is considered abnormal. For company names appearing in a transaction, we can resort to a variety of string matching procedures. For recurrent transactions, we need to spot those transactions whose descriptions are very similar to each other.

An important concept arises: that of similar transactions. There are many ways to measure the similarity of two transactions, one of which is the Jaccard index, widely used in computer science. This index is defined as the ratio of words in common to the total of different words across both transactions. For example, the Jaccard index for “A bank transaction description example’’ and “A bank transaction description a bit longer” would be 4/8, 50% similarity.

Once we know what similar transactions are, we can find groups of similar transactions. Forming groups of entries in a dataset is known as clustering. In the case of grouping text, a well-known technique is hierarchical agglomerative clustering. Each transaction begins as their own group and, step by step, the two most similar groups of transactions are combined into the same group of transactions. The process continues until there are no groups sufficiently similar to be merged.

Transaction descriptions are usually short pieces of text, so they contain little information. One approach to address this shortcoming is to enrich the transaction with external information. For instance, if we have a database with company names, and a transaction contains a company name in this database, then we can extend the information of the transaction with the industry or sector of the company, for example, which could help the classification task. One approach to match companies in transactions is simple string matching. If we want to account for partial or abbreviated company names, then another approach is fuzzy matching.

In our case, we stumbled upon a limitation when trying to match against a dataset of around half a million companies: the process of fuzzy matching was too slow. Using the so-called vector space model, transactions and company names can be transformed to vectors of 1s and 0s — where a 1 indicates the occurrence of a particular word — and be grouped into a matrix of transactions and companies. With this encoding, fuzzy matching of words can be done with matrix multiplication. And fuzzy matching at character level can be done by encoding groups of characters, instead of just words. The matrix of companies can be precomputed and since matrix multiplication is generally a highly optimised operation, this arrangement yields a higher throughput.

Whether a transaction amount is an outlier, whether the transaction is recurrent — or even the number of months that the transaction repeats — , the industry of the company that appears in the transaction, or the occurence of certain keywords are all examples of features that can serve as inputs to a classification model for bank transactions. If there are many of these features, for example, if there are many keywords among those features, then it is important to resort to feature selection techniques. These techniques reduce the number of features to a subset that is still discriminative enough. Examples of these are mutual information or the chi square test of independence. The former measures the goodness of the information that a keyword or feature contains to explain the categories in the training dataset. The latter is a test for whether a keyword or feature occurs independently of the observed categories. So these two methods offer insight into which features are helpful and which are not.

Once a set of features has been selected, they can be passed as inputs to a model for classification. Typical examples include support-vector machines, random forests or k-nearest neighbours. On a training set with transactions labelled by category, these models learn statistical associations between the selected input features and the categories. Some of these models even have a natural way of providing a score that indicates the strength of the supporting evidence that was found when deciding on a category for a new exemplar, which is useful for prioritising which transactions deserve more attention. The result is a tool that can help in the tedious task of labelling new transactions.

Analysing and Classifying Bank Transactions

Written by Ignacio Funke