Streamlining transaction categorisation at scale — Part 2

4 min readJul 16, 2024

In the first part of our blog, we talked about the user-centric approach Cheddar adopted to enhance its Personal Finance Manager (PFM), focusing on understanding user needs and developing a transaction categorisation system. This involved mapping transaction types to spending categories, ensuring the system was intuitive and useful for users. As we move into the second part, we will explore the processes of data cleaning and mappings, including the extraction of merchant names, linking them to common retailers, and assigning merchant category codes to categories to ensure accurate and efficient transaction categorization.

Part 2 — Data cleaning and mappings

Merchant Name Extraction

The development of our merchant name extraction feature also involved a collaborative effort between our engineers and data scientists. The data scientists analysed card payments and direct debits across various banks, identifying distinct patterns to extract merchant names when available. For example, a card payment in certain banks might be recorded as “CARD PAYMENT TO PAYPAL * TESCO,” while in others, it could be reported as “Paypal * Tesco — London, GBP.” In both cases, our system discerns the commonality, extracting the essential merchant name — “PayPal * Tesco.” Similarly, different banks may represent direct debits as “Direct Debit to The Trainline’’ or “DD — Trainline Ref 8938493,” with our unified approach extracting the consistent merchant name — “The Trainline.” This collaborative refinement ensures users with a standardised experience across diverse banking platforms.

In open banking, card payments must include a field with the merchant name. However, not all banks adhere to this. Some don’t include a merchant name field, and instead, this information is part of the transaction details or another proprietary field. Banks that do not return a merchant name but include it as part of another field often use phrases like “card payment to,” as mentioned above. Additionally, another challenge we encountered is that sometimes the merchant name includes the city name, address, or purchase date. For instance, some banks may include information like “Wagamama Limited 229 London GBR, Transaction Date: 2024–02–02” as part of their transaction details. It was our task to extract “Wagamama Limited” from that field and assign it as the merchant name.

2. Retailer Mapping

The data scientists within our engineering team spearheaded the Retailer Mapping initiative, looking into examples across various banks to enhance the recognition of common retailers frequently used by our customers, such as Just Eat or Amazon. In their exploration, they uncovered multiple variations in merchant names for a single retailer, including “Just Eat,” “JUST EAT,” “Paypal * Justeat,” or “CRV * Justeat.” Faced with this complexity, they devised regular expressions to identify and categorise the different retailer names. Given the complexity of this task, the team focused their efforts on a select group of only highly frequented retailers. At the end of this step, all the previous different merchant names for Just Eat will be consolidated into a single retailer.

This step proved crucial because even after extracting merchant names from open banking transactions, significant variability persisted, as explained. For each highly frequented retailer, a tailored regular expression was necessary to match their name uniformly across different banks. Variations included differences in case sensitivity, parentheses, codes, and truncation lengths, varying from 15 to 18 characters or more across different banks.

Additionally, payment providers introduced further variability. For instance, PayPal transactions for The Trainline often appeared as “PayPal * thetrainline” without spaces. Similarly, Buy Now Pay Later providers like Zilch or Klarna might truncate names, such as “Zilch * CRV The Trainli.” Despite these challenges, the regular expressions were designed to accurately identify and categorise these transactions under a unified retailer name, such as The Trainline, ensuring consistency. Finally, the data scientists collaborated with the product team led by Tariq Zaid to assign a category and subcategory to each of the frequently found retailers.

3. Merchant Category Code Mapping

In handling less common retailers, our adept data scientists adopted a pragmatic approach by examining the Merchant Category Code (MCC) associated with numerous transactions. This code, collaboratively defined by Visa and Mastercard, serves as a framework for categorising merchants utilising card payments. For instance, a code 5411 signifies Grocery Stores and Supermarkets, 5611 denotes Men’s and Boys’ Clothing, and 3504 represents hotels affiliated with Hilton. Through an analysis of the retailers linked to each MCC, our data scientists devised a comprehensive mapping system. This system enables the assignment of specific spend categories to each MCC, ensuring consistent categorization for transactions involving a diverse array of merchants. For example, transactions with code 5611 go to the “fashion” category and a subcategory of “clothing,” while those with code 3504 are linked to the “travel” category with a subcategory of “lodging.”

To assign categories to the merchant category codes, the data scientists collaborated again with the product team led by Tariq Zaid. The data scientists identified the most common retailers for each MCC and presented this information to the product team. After discussions, they defined a category for each MCC. This required specific knowledge of UK retailers because the MCC standard, initially defined for the United States, has differences. For example, charity shops in the US are seen more as donation centres, whereas in the UK, some national parks or grocery shops are charity organisations. Should a lunch at a national park restaurant be categorised as a donation or as a restaurant expense? This required careful consideration and sometimes overriding the default MCC categorization to fit the UK context accurately.

Coming Next

In the next part of this blog, we’ll discuss how we added Artificial Intelligence (AI) to enhance transaction categorization. We’ll discuss the use of numerical embeddings to handle transactions lacking MCCs, the development of a machine learning model to predict transaction categories, and the human testing process that ensured the accuracy and usability of our system.

Streamlining transaction categorisation at scale — Part 2

Part 2 — Data cleaning and mappings

Coming Next

Written by Mauriciotorob