How to Build a Machine Learning Service for Classifying Financial Transactions

Shawn Cao

Published in

Nerd For Tech

5 min readJul 22, 2024

A Practical Guide to Using Data Science Tools like scikit-learn

Categorizing Transactions @ https://www.fina.money

Introduction

In today’s fast-paced world, efficiently managing finances is more crucial than ever. Whether you’re a small business owner, a freelancer, or someone who closely monitors personal finances, categorizing transactions can be a daunting task. Recognizing this challenge, I have developed a machine learning model to make transaction categorization effortless and lightning-fast.

I have made this API available to the public for free. Whether you’re building your own expense tracker, budgeting app, or categorizing transactions in a spreadsheet, this API can save you hours of labor.

Quick Glance — Test with Curl

curl -L -X POST \
-H "Content-Type: application/json" \
-H "x-api-key: fina-api-test" \
-d '["BOILING POINT- BELLEVUE"]' \
https://app.fina.money/api/resource/categorize

If you run this test in your command line, you will instantly receive the following result:

["restaurants & other"]

This API is incredibly simple: it takes a list of transaction names as input and returns a list of category names as output, formatted in JSON.

An equivalent TypeScript function signature would look like this:

categorize = (transactions: string[]) => string[];

Why I developed this API

1. Simplifying financial management

Often, people use spreadsheets to process their transaction data for analysis or budgeting purposes, and they have to manually categorize each transaction one by one. This process can be time-consuming and prone to errors. Even most budget apps lack auto-categorization functions. Therefore, I developed a fast service to address this need.

2. Why not use a LLM?

While many large language models can perform this task adequately, they are often too slow for a fast data processing system that requires low latency. Additionally, their latency and the structured results they provide can be unreliable. In contrast, this service offers consistent latency and lightning-fast performance.

such as benchmark reads: 1000 transactions categorized in in less than 1 second

How the service was created

Now, let’s delve into the details of how this API and the underlying model were developed. You can follow this process to build your own version!

1. Training the ML model

The foundation of this API is a machine learning model trained on a vast dataset of transactions that have already been categorized by aggregators like Plaid.

By leveraging the sklearn Python library and preprocessing data with pandas, I trained a model using a CountVectorizer and RandomForestClassifier. For model serialization and deserialization, I chose joblib. In the future, I plan to improve accuracy by experimenting with different vectorizers and classifiers.

Import the libraries: pandas, sklearn, and joblib

import numpy as np
import pandas as pd
import json
from json import JSONEncoder

# importing sklearn libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import CategoricalNB
from joblib import dump, load

Prepare the training dataset by pulling 24,000 transactions from various bank accounts and labeling them. The training dataset is stored in a CSV file with four columns: 1) Transaction Name, 2) Merchant Name, 3) Amount, and 4) Category.

Run the trainer and save the model

def train():
    # the training set is just transaction name -> category
    train_data = pd.read_csv('../data/cat_train.csv', encoding='utf-8', names=['name', 'merchant', 'amount', 'category'], header=None)

    # train the classifier
    vectorizer = CountVectorizer(ngram_range=(1,2), max_features = 2500)

    # let's build a train column based on name, merchant and amount
    # train_col = train_data['name'] + ' ' +  train_data['merchant'] + ' ' + train_data['amount'].map(lambda x: 'income' if x > 0 else 'expense')
    train_col = train_data['name']
    # normalize the names to lowercase and remove special characters, and remove words that are less than 3 characters
    names = train_col.str.lower().replace('[^a-z0-9]', ' ', regex=True)

    y_train = train_data['category']
    x_train = vectorizer.fit_transform(names.values.astype('U')).toarray()

    # train the model
    classifier = RandomForestClassifier(n_estimators=50, random_state=42)
    # classifier = CategoricalNB()
    classifier.fit(x_train, y_train)

    # save the model
    dump([vectorizer,classifier], '../data/categorizer.joblib')

After running this training process, it will generate a model at a size of about 500MB, we save this model to the disk through joblib library.

2. Serving the model

Next, we need to serve the model through a service endpoint that handles HTTP requests to fulfill the API specification. I chose the popular Flask framework and Waitress server for this purpose.

Import libraries to build up the service: flask, waitress

import logging
from flask import Flask, jsonify, request
from waitress import serve

# import functions to serve api
from model.bank import classify

Serve the API endpoint

app = Flask(__name__)

# categorize API accepts a json array of strings and returns a json array of strings
# test with curl:
# curl -L -X POST -H "Content-Type: application/json" -d '["ACH Debit FLAGSTAR BANK - LOAN PYMT", "Credit Dividend"]' http://localhost:9999/categorize
@app.route("/categorize/", methods=["POST"])
def categorize():
    # see data model ForecastInput is the json object
    return jsonify(classify(request.json))

if __name__ == "__main__":
    log.info("Categorizer is listening at port 9999")
    serve(app, host="0.0.0.0", port=9999)

Serve the request using the trained model

def classify(input):
    global loaded, vectorizer, classifier
    if not loaded:
        vectorizer, classifier = load('./src/data/categorizer.joblib')
        loaded = True
        
    # remove special characters and short words from the input
    input = pd.Series(input).str.lower().replace('[^a-z0-9]', ' ', regex=True)
    x = vectorizer.transform(input.values.astype('U')).toarray()
    categories = classifier.predict(x)
    result = json.dumps(categories, cls=NumpyArrayEncoder)
    return result

Test the API!!
As demonstrated at the beginning of this post, the API returns a list of categories for the input transactions. Now it’s your turn to give it a try!

Conclusion

In this article, I detailed the entire process of building and deploying a transaction categorization API as a fast, accurate, and scalable solution for simplifying the classification of your expenses and income data.

Using straightforward and effective data science tools, this approach harnesses the power of machine learning to streamline financial management and provide the insights needed to make smarter financial decisions.

The API is free to try and use for your personal projects. You can use the “test-api-key” provided in the test above. If you need support or have any questions, feel free to leave a comment or reach out via email. I hope you find it helpful!