How to use document similarity to identify vendors in invoices

Published in

Sage Ai

6 min readJan 11, 2023

Before my first contact with Sage, I had to google the name to find out what they do. When I heard it was developing enterprise resource planning and accounting software, I admit my first impression was that it sounded a bit boring. I pictured a desk with tons of papers and someone bashing numbers into one of those huge 80s calculators.

But I couldn’t have been more wrong.

I soon discovered Sage has been automating accounting processes for years. And today, this includes the use of AI and machine learning. Turns out, they created Sage AI to transform Sage into an AI-enabled technology business. Sounds more interesting, doesn’t it?

In this post, I’ll share an example of a typical problem we solve at Sage AI: the automation of invoices. This case not only makes a great success story, but also shows how much discussion, thinking, and strategy goes into achieving some of our results.

Automating vendor identification

Why identifying the vendor is important

Processing invoices is crucial in accounting, especially for the Accounts Payable (AP) department. AP are payments due to vendors for goods or services received by the company. Identifying which vendor the customer owes money to is usually a simple task.

But if there are lots of invoices, doing this manually can become repetitive and lead to errors, such as paying the wrong vendor or losing/duplicating a payment. Automating the process massively improves its reliability, and we can do this even better with machine learning.

Issues to determine vendor in invoices

Let’s imagine a vendor (Carles Panades Guinart Analytics) has done a wonderful job, created an invoice, and sent it to a customer (Sage UK Limited).

First, the customer will record the data in the accounting system, including the vendor’s details. It’s clear that the name of the vendor appearing in the invoice is “Carles Panades Guinart Analytics’’.

However, customers often assign vendors different names, such as partial (CPG Analytics) or full acronyms (CPGA), the name of the contact person (Carles), have typos (Charles Panacea Gurnard Analytics), or even use a vague name related to the industry (Data Science Company). When this happens, the name in the document might not match the assigned vendor name in the system.

Additionally, one of the most common AP automation solutions to this problem is optical character recognition (OCR), but this can make some mistakes. For example, the letter “A” can be transformed to “4” in OCR processing. In this case, “Analytics” would become “4nalytics”, which is incorrect and complicates the matching.

So, using automation alone doesn’t always provide the best results.

Proposed approach

Similarity of invoices

For the same invoice example, let’s assume “CPG Analytics” was assigned as the vendor’s name in the customer’s accounting system. In the future, there might be more invoices, such as this one:

Figure 2: Another invoice from same supplier as Figure 1.

This obviously looks very similar to the first invoice, and this is what we can take advantage of. We’re able to match the new invoice to the vendor with the most similar invoice from the past. Though this is a simple idea, there’s a lot of complexity in measuring similarity and executing a match accurately.

Measuring similarity: Document similarity using tf-idf

To measure similarity, we use something called document similarity, which looks at certain parts of the text to identify the vendor. It’s a deeply studied topic within natural language processing and offers a huge variety of methods to compute text similarity, such as:

Jaccard similarity and related methods, for instance Locality-Sensitive Hashing (LSH)
Term frequency-inverse document frequency (tf-idf)
Doc2vec
Bidirectional Encoder Representations from Transformers (BERT)

For this solution, we use tf-idf because of its accuracy, well-tested experience in data science battles, and being relatively easy to implement. We won’t dive into the details of tf-idf, as this is a topic in itself. Instead, we’ll explore how we used it to solve our problem.

Implementing tf-idf

The first step we took was to use the invoice in Figure 1 to train our machine learning model. This involved extracting the following text from that invoice, and using it to create a list of invoices as the input of the model:

invoice_1 = 'Carles Panades Guinart Analytics 221B Baker Street, London NW1 6XE United Kingdom Phone +44 21 7737 0934 INVOICE NO. 001984 2022-07-21 BILL TO SHIP TO INSTRUCTIONS Sage (UK) Limited C23 5 & 6 Cobalt Park Way Wallsend NE28 9EJ United Kingdom Same as recipient None QUANTITY DESCRIPTION UNIT PRICE TOTAL 1 Data Science Mumbo Jumbo 10000.00 10000.00 SUBTOTAL 10000.00 SALES TAX 21% SHIPPING & HANDLING TOTAL DUE BY DATE 12100.00 GBP Thank you for your business!'invoices_train = [invoice_1]

Note that the vendor is known, so we also have the corresponding list of vendors:

vendors_train = ['CPG Analytics']

We then built our tf-idf vectorizer and used it to fit the nearest neighbor algorithm, which provides the most similar document with a few lines of code:

from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.neighbors import NearestNeighborsvectorizer = TfidfVectorizer()X = vectorizer.fit_transform(invoices_train)nn = NearestNeighbors(n_neighbors=1, metric='cosine')nn.fit(X)

Finally, we can predict the vendor on the second invoice (Figure 2) after processing it in a similar way:

invoice_2 = 'Carles Panades Guinart Analytics 221B Baker Street, London NW1 6XE United Kingdom Phone +44 21 7737 0934 INVOICE NO. 001999 2022-07-25 BILL TO SHIP TO INSTRUCTIONS Sage (UK) Limited C23 5 & 6 Cobalt Park Way Wallsend NE28 9EJ United Kingdom Same as recipient None QUANTITY DESCRIPTION UNIT PRICE TOTAL 1 More Data Science Mumbo Jumbo 5000.00 5000.00 1 Machine Learning Gobbledygook 10000.00 10000.00 SUBTOTAL 15000.00 SALES TAX 21% SHIPPING & HANDLING TOTAL DUE BY DATE 18150.00 GBP Thank you for your business!'invoices_test = [invoice_2]invoices_test_tfidf_matrix = vectorizer.transform(invoices_test)nn.kneighbors( invoices_test_tfidf_matrix )

This results in:

(array([[0.03479332]]), array([[0]]))

The first element of the tuple are the cosine distances, and the second one the indexes of the nearest neighbor. In this scenario, the closest document to the second invoice is the one of index 0 (corresponding to the first invoice in the invoices train, resulting in “CPG analytics” from the vendors_ train[0]). The score is a cosine distance (0 the best score and 1 the worst), indicating the result of 0.034793332 is quite good. Note that in this example, each one of the elements in the tuple is a matrix of dimension 1 x 1, because there was only one invoice to predict, and it provided the most similar recommendation (we only considered one nearest neighbor).

Challenges in production

This method provides a great level of accuracy for identifying recurring vendors. But as the customer business grows, new vendors will keep appearing, and we can’t make predictions on those as there are no previous invoices. To account for these new vendors, we frequently retrain the model with validated data.

This is reliable, trustworthy data from customers that will serve the model well. But trust is something that works both ways, and our predictions also need to be trusted. Therefore, it’s crucial we inform customers using our services if a prediction is not good enough.

Based on the available data, an easy way to achieve this is with a threshold on the scores. In other words, if the score is above a certain value, then tag it as potentially untrustworthy. Trust is a critical piece in this picture, and from our experience, real trust is built brick by brick.

Hopefully this post has taught you a little about invoice similarity using tf-idf. After exploring this complex problem from many different perspectives, we were able to come up with this relatively simple solution. The approach we used is effective, and you can try it yourself with only a few lines of Python. Our focus was on the vendor, but this method can be immediately extended to other accounting fields, such as the type of expense.

How to use document similarity to identify vendors in invoices

Automating vendor identification

Proposed approach

Challenges in production

Written by Carles Panadès Guinart