Slope TransFormer: The first LLM trained to understand the language of banks.

Alex Wu
Slope Stories
Published in
12 min readNov 14, 2023

(Patent pending)

Today, we’re excited to share that we’ve developed the first Large Language Model (LLM) trained specifically to understand the language of banks: Slope TransFormer. It categorizes messy bank transaction data with speed and accuracy that surpass Plaid, ChatGPT, and humans. As the successor to SlopeGPT, it is the first LLM we’ve trained in-house.

We will share the motivation for it, the methodology used, and its results — including how it stacks up to existing solutions. We will end with some immediate applications, and how it fits into our vision of redefining how underwriting is done.

Why do we care about transactions?

First, some context. At Slope, we spend a lot of time with bank transactions. Why? Simply put, we are a payments company, and central to every payments company is risk. To that end, there is no better way to understand a business — what it’s been up to, its financial outlook, its fraud risk — than looking at every $ that flows in and out of it. The transaction, as we see it, is the atomic unit of business. It is the lifeblood.

Additionally, bank transactions have 2 critical properties:

  • Real-time. Thanks to Open Banking (e.g. Plaid), once a business connects their bank accounts, we will see every new $ that flows in and out of the business in real-time.
  • Unfalsifiable. Thanks to banks, a transaction is proof of an exchange of money. One cannot fake a transaction that’s pulled directly from their bank’s records (contrast this to an income statement).

At Slope, we strive to understand our customers deeply. Doing so not only enables us to assess risk, but fundamentally to build better products for our customers: from AR automation, to payments, to financing that’s personalized to a business’s unique needs. Transaction fluency, therefore, is a fundamental problem for Slope.

However, transactions are hard to understand.

The issue is that transactions are not written in English, or even a single language, for that matter. It is a language of many dialects: a single transaction type can be expressed 10 different ways across 10 different banks:

These are all payments from Shopify.

Additionally, a transaction can be complex. It may have components that represent different counterparties, channels, and intermediaries which obscure the true flow of money. This opaqueness is only furthered by the rise of payment processors and middlemen (e.g. PayPal, Zelle, and even Slope).

Can you tell where the money is going here?

BILL.COM DES:ACCTVERIFY ID:025AYXVFMTBCRRX INDN:DAVID VAN ARCH CO ID:XXXXX634527 CCD

If you consider the combinations of (bank dialects X merchants X intermediaries) — and also that a “merchant” can be any individual or entity in the world, and that new intermediaries are spawning every day — it becomes clear that transactions cannot be solved with traditional, rules-based methods. It is a high-dimensional, long-tail problem that even specialist companies often struggle to get right.

What about existing solutions?

Plaid

As our Open Banking provider, Plaid serves us transaction data pulled directly from our customers’ bank accounts. On top of this, Plaid tags the counterparty of each transaction (e.g. Shopify). But only sometimes. We found that Plaid gives us less than 50% coverage across our customers’ transactions:

And even when tags are provided, they can be noisy. Some examples:

  1. Noisy labels for even well-known merchants:

2. Confusing the person, Aldo, for the company, Aldo:

3. A single description resulting in a wide range of labels:

While some of these mistakes may seem elementary, producing accurate tags on a consistent basis is a deceptively difficult task – with many hidden tradeoffs. For the most part, Plaid does a very good job. But in our application — B2B risk assessment — we have especially strict requirements when it comes to accuracy, coverage, and explainability. We cannot afford a mistake with so much on the line.

ChatGPT

What about LLMs? There are 2 promising properties of LLMs in the context of transaction tagging: 1) their ability to extract meaning from unstructured data and 2) their pre-trained knowledge of the world. Here are some of our experiments with ChatGPT:

Wrong answer & super wordy.
Better with some prompt engineering, but still wordy.

Assuming we solve for accuracy and wordiness, there are still fundamental issues with a chat-based approach: unpredictability (the same prompt asked 10x may give you 10 different responses) and scalability (slow and expensive to hit an API 1000’s of times for a single customer). Yet, we saw promise. We began to believe that in some form, LLMs held the key to our problem.

SlopeGPT

Earlier this year, we launched SlopeGPT. Using GPT embeddings, we clustered transactions by semantic similarity. This allowed us to reliably group transactions into distinct cashflows without explicitly labeling them. Additionally, as the clustering happened at the customer level, the cashflows were fit uniquely to each business.

The impact was massive: from raw transactions emerged a rich story. We could now see individual streams of incomes and expenses, how they changed over time, and where they were headed. It was a major leap forward in our ability to understand our customers. Still, it had limitations:

  1. The resulting clusters were unlabeled: it could tell you which transactions likely belonged to the same cashflow streams, but not what those streams were.
  2. It was not optimized for financial data. We used out-of-the-box GPT embeddings, meaning we used English semantic similarity as a proxy for transaction semantic similarity. It worked surprisingly well, but we believed we could do better.
  3. It was slow: ~500 ms/txn. This may seem fast, but a single customer may have thousands of transactions. Our SLA for underwriting is 7s.

We’re excited to say that TransFormer overcomes all these limitations.

Introducing: Slope TransFormer ⚡

Slope TransFormer is a proprietary LLM fine-tuned to extract meaning from bank transactions. It produces accurate, concise counterparty labels in an interpretable, deterministic way.

A more in-depth analysis is shared below.

Additionally, it is highly performant. TransFormer can label over 500 transactions/sec — a 250x speedup over SlopeGPT. Let’s see how that’s possible.

What makes TransFormer unique

An Efficient Base Model

We started with an efficient, open-source foundation model: OPT-125M (Open Pre-trained Transformer). Why such a “small” model? Consider the dimensionality of the language of transactions. While it is too vast to solve with rules, it is minuscule compared to English. Additionally, there is only one task at hand, and it is very simple by LLM standards. 77B parameters may be necessary for building AGI — overkill for classifying transactions.

Defining a new language.

One thing to remember is that an LLM is just a text completion bot. An autoregressive model like OPT or GPT-4 is especially simple in functionality: given a sequence of words (or tokens), it produces the next word that is most likely to appear based on the text it has seen before. It repeats this process until it outputs a special “stop” token. The newly generated tokens form the response.

Left: Andrej Karpathy, State of GPT [1]. Right: Google.

The problem is, we want our LLM to perform a very specific task: identifying the key counterparty of a transaction. We don’t want it to spin a story, or for it to even reply in English, necessarily. We want the merchant name, and that only.

Thankfully, there’s a trick. If you can find a special symbol that does not naturally appear in the language domain, you can inject that symbol into your text data — followed by whatever you want your model to predict. During training, this newly injected pattern (i.e. <symbol><task>) will be learned the same way any other grammatical pattern is learned (i.e. <subject><predicate>). In this way, you can trick the LLM — which is just a text generator — to perform any task you define.

For example, we can define the task of merchant labeling as such:

In effect, we have defined a new language. In this language, a sentence is composed of:

  • Subject: merchant (the primary counter-party)
  • Objects: intermediaries and other counter-parties
  • Verbs & adjectives: payment type, locations, IDs, and other gibberish
  • Punctuation: Our special symbol, =>, represents the end of a sentence. And it is always followed by the subject, and only the subject.

Now, we just need to teach it to our model!

Training TransFormer

Of roughly 6M transactions, Plaid was able to automatically tag 2.5M with merchant names. We cleaned and filtered these down to 66K high-quality labels. We then augmented this by hand labeling another 2K transactions.

Fine Tuning Algorithm: LoRA

For the actual training procedure, we employed a technique called LoRA (Low-Rank Adaptation of Large Language Models). LoRA enables efficient fine-tuning of foundation models by freezing the weights of the original model and training a simplified representation instead [2]. Beyond efficiency, this has the added bonus of mitigating “catastrophic forgetting” — when a model’s weights are altered so much it loses key knowledge it once learned.

Training Regimen

The full regimen is shown below. One important thing to note is that as designed, the curriculum allows the model to draw from general knowledge of the world (e.g. that San Jose is a place, while Roku is a tech company) and apply it to a specific task, in a very specific domain. This, in our view, is the true power of applied generative AI.

Results

After just a few hours of training, our model reaches fluency. Here are some real outputs (with sensitive info scrambled or stripped):

// Recognizes ROANOKE is a location, not a business: 
JIFFY LUBE ROANOKE VA 10/06=>JIFFY LUBE

// Tough one: multiple counterparties:

CHIPS CREDIT VIA: CITIBANK N.A./0008 B/O: POWERNODE INC. MARKHAM ON/CA REF: NBNF=ORATION CITY OF INDUSTRY CA 91748-1114 US/AC-0 00000005234 ORG=/CAHKBC51234920843 MARKHAM ON/CA OGB=HSBC BANK CANADA VANCOUVER CANADA CA BBI=/CHGS/USD0=>POWERNODE INC.

// Outperforms a real human: the labeler saw “drive thru” and
// thought EDINBURG was a burger restaurant!
PURCHASE AUTHORIZED ON 08/28 VERASDRIVETHRU EDINBURG TX S62345361234096 CARD 5932=>VERASDRIVETHRU

Across the entire test set, TransFormer is able to achieve over 72% exact match accuracy against a human expert. Plaid, by comparison, achieves just 62%:

The test set is comprised of 229 randomly sampled transactions that were then hand-labeled by an expert. Here, Jaccard similarity measures the similarity between the hand label and the prediction — ranging from a score of 1 (exact match) to 0. We take the mean Jaccard similarity across the test set.
Samples.

Note that when building the test set, we sampled only from transactions that Plaid was able to tag. In practice, we would also need to consider coverage: Plaid tags less than 50% of our customers’ transactions — TransFormer tags all of them:

Interpretability and reliability

But TransFormer is more than just accurate: it is highly consistent. To understand how, let’s walk through a specific prompt: #X01G52I5Z AMAZON.COM.

Remember, an LLM works by chunking words into subwords (tokens), and repeatedly predicting the most probable next token until it eventually predicts the stop token. Here, I’m visualizing the confidence of each token of the prediction:

The stop token: </s> is ignored in postprocessing.

Not only is the prediction correct, but it’s almost 100% confident!

Let’s dig one layer deeper. Under the hood, the model does not predict just one value for each token, but a probability distribution across its entire vocabulary. Our model has a vocabulary of 50,000 tokens. That means 50,000 possible outputs for each position, with a cumulative probability of 1.

Left: probability distribution of each output across entire vocabulary. Right: top 3 most probable tokens of each output.

Notice the certainty of each output: there is only one possible answer in the model’s mind! Interestingly, the least certain output is the final one: the top candidate</s> (end line) has a probability of 96%, where the next candidate . has a probability of 3%. Here, the model makes a choice: whether to end the sentence after ON or continue with . and COM. It chooses (correctly) to be succinct.

In a production system, interpretability and reliability are critical. To that end, there is a major advantage of training an LLM for a specific task rather than retrofitting a chat-optimized model like ChatGPT. ChatGPT may give you 10 different responses for the same question asked 10x where TransFormer, in practice, is deterministic.

Performance

Due to greater efficiency (lighter, self-hosted model), TransFormer achieves a staggering 250x speedup over SlopeGPT. This makes it productionizable for real-time underwriting!

A full comparison of TransFormer vs. its predecessor:

Impact

Thanks to its greater functionality and performance, we’ve already been able to deploy it and reap its benefits. TransFormer now powers all our live credit monitoring dashboards:

From a glance, one gets a high-resolution, piercing view into the business. This business in particular appears to be a dropshipper who sources from Alibaba and resells on Shopify. Additionally, notice that SHOPIFY is categorized separately from SHOPIFY CAPITAL. This was intentional! We labeled the two distinctly in the training data, as one is a revenue source, while the other is a form of financing. You can validate that yourself by looking at the time series: notice the steady inflows (red) compared to the chunky lump sums (purple). Additionally, notice SHOPIFY CAPITAL in the outflows (purple): we’re able to validate that the business is making steady repayments!

TransFormer gives us such a view into our entire portfolio — refreshed daily. It unlocks new, powerful signals we can use to monitor changing risks, raise alerts, and apply automated adjustments. For example:

  • Trends and seasonality of individual cash flow streams.
  • Correlations between inflow and outflow streams (e.g. identifying fixed vs. variable costs, trends in profit margin, inventory turnaround time).
  • Abnormal events (e.g. a sudden stop in loan payments or revenue source).

We are now working to deploy TransFormer to power our entire underwriting system. Stay tuned!

Our vision for SlopeAI

Financials-based underwriting has been the status quo since the invention of lending. And for good reason: it works.

But we think we can do better. Our vision is to reach the precision of financials but from the bottom up: starting with the fundamental unit of business, and constructing financials that are unfalsifiable, refreshed in real-time, and custom fit to each business. In fact, we’d like to understand our customers better than they understand themselves. TransFormer marks a major milestone towards that vision.

Finally, this goes beyond risk and underwriting. Because our goal at Slope is to digitize the world’s B2B economy. In conjunction with TransFormer, we’ve been developing AI to tackle the entire problem space: from improving the KYB process, to automating order-to-cash workflows, to building personalized financial products to help our business customers grow. More to come soon!

Ultimately, we don’t think that AI should be front-and-center of a business’s user experience with Slope. But we do believe AI will play a big role in automating workflows that, for decades, have been inefficient, manual and excruciatingly painful to manage. These inefficiencies, in our view, are the only things stopping the digitization of the $125T B2B economy. We are excited to change that.

If any of this made you excited, please reach out. We value deep tech backgrounds from all disciplines (our own backgrounds beyond finance range from aeronautics to AI research to autonomous driving). Join us on our journey!

References

[1] Andrej Karpathy. State of GPT. 2023.

[2] Hu, Edward J., Shen, Yelong, et al. LoRA: Low-Rank Adaptation of Large Language Models. 2021.

A huge thanks to Lawrence Lin Murata, Alice Deng, Bryant Chen, and Jason Huang for editing and revising this blog, as well as the entire Slope team for building the incredible tech foundation that enabled this work.

--

--