Fuzzy Matching, Word Vectors and Google ML Kit: Lessons Learned in Text Analysis and Classification

Marczhan
Warwick Artificial Intelligence
14 min readFeb 8, 2022

--

Introduction

The following details the creation of Tickit, an application for the analysis of shopping trends derived from images of receipts. Throughout our journey, we have encountered many potholes and have pivoted countless times from our original objectives to achieve something more realistic and attainable. We originally started off by wanting to make an all-inclusive app that utilises computer vision and classification to analyse and categorise a user’s shopping trends. We then decided to shift our focus away from the app as that would involve a lot of front-end work that was no longer our main priority with this project. We thought it would be more beneficial to work solely on the pipeline that would analyse and categorise the shopping receipts which we have posted on our GitHub so that people can implement our procedures in any manner they like. Our pipeline executes three main functionalities; scanning a shopping receipt, organising the items on the sheet with computer vision and categorising them using a classification method.

Computer Vision Challenges

Seeing as we initially planned on developing an Android app to host the text classifier’s backend code, we assumed quite early on that we could simply use some Android API to perform the computer vision on the shopping receipt and extract the text before sending that to the backend.

We decided to use Google’s ML Kit, as it allows for on-device text recognition and was fairly easy to add to the app by following this guide. After using ML Kit, we then would slightly reformat the extracted text from the receipt, before passing it to the Python scripts responsible for the actual text classification.

However, the issue was that ML Kit was specifically bad at correctly recognising text from shopping receipts. Most of the time, ML Kit would incorrectly add punctuation, scramble the text, and fail to extract item prices from the receipt.

Realistically, we should have expected that such a general-purpose Computer Vision implementation would not perform well when faced with receipts. The text is often much smaller, more difficult to read, and the receipt itself may be slightly crumpled. Moreover, sometimes the receipts would be slightly see-through, resulting in ML Kit extracting text from the other side of the receipt as well. We attempted to investigate whether we would tweak the ML Kit API to perhaps improve its performance, but unfortunately, it was not possible to alter the Computer Vision model.

Seeing as we were unable to realistically use ML Kit’s Computer Vision API within our app even with all the preliminary solutions tested, we were forced back to the drawing board. Eventually, we found Asprise’s Real-time Receipt OCR and upon testing the model, not only was it able to accurately extract text from the receipt, but it would even organise the information into a nicely formatted JSON file.

However, this led to us encountering a new issue: the API was too sensitive. It managed to read text on the back of the receipt and minor changes in lighting affected performance greatly. Fortunately, through the use of Gaussian blurring, we could eliminate much of the noise that was affecting image clarity. This allowed us to get far better results with Aspire’s API.

Thus, with unanimous agreement from our team, we decided to swap from ML Kit to Aspire’s API — resolving our computer vision issues and allowing us to focus on the actual text classification part of the project.

Initial Solutions to ML Kit’s problems: Fuzzy Matching and Word Vectors

Fuzzy logic vs traditional logic

In order to understand how fuzzy logic works, we have to consider traditional logic. Traditional logic is binary, i.e a statement is either true or not, for ex : a record in one table either exactly match a record in table 2 or not.

However, fuzzy logic indicates the degree to which a statement is true, a number between 0 and 1. Then it takes the best possible decision for the given input. In real life, we may come across a situation where we can’t decide whether the statement is true or false. At that time, fuzzy logic offers a very valuable flexibility for reasoning.

Fuzzy Strings Matching: An application of fuzzy logic

Fuzzy matching is a technique that helps identify two elements of text, strings, that are approximately similar but are not exactly the same. It is an application of fuzzy logic. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

A use case: A spell checker and spelling error, typos corrector. For example, a user types “Missisaga” into Google, a list of hits is returned along with “Showing results for mississauga”. That is, search query returns results even if the user input contains additional or missing characters or other types of spelling errors. [1]

How to perform Fuzzy Name Matching? [2]

The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance between the string and the pattern. The usual primitive operations are:

  • insertion: cotcoat
  • deletion: coatcot
  • substitution: coatcost

Some approximate matchers also treat transposition, in which the positions of two letters in the string are swapped, to be a primitive operation.

  • transposition: costcots

Different approximate matchers impose different constraints. Some matchers use a single global unweighted cost, that is, the total number of primitive operations necessary to convert the match to the pattern. For example, if the pattern is “coil”, “foil” differs by one substitution, coils by one insertion, oil by one deletion, and foal by two substitutions. If all operations count as a single unit of cost and the limit is set to one, foil, coils, and oil will count as matches while foal will not.

Other matchers specify the number of operations of each type separately, while still others set a total cost but allow different weights to be assigned to different operations.

By matchers, here we mean mathematical formula that can be used in performing Fuzzy Name Matching. Some mathematical formulas include :

  • Levenshtein Distance
  • The Soundex Algorithm
  • The Metaphone and Double Metaphone Algorithms
  • Cosine Similarity

Word Vectors

Word vectors are simply vectors of numbers that represent the meaning of a word. Traditional approaches to NLP, such as one-hot encoding and bag-of-words models used dummy variables to represent the presence or absence of a word in an observation, i.e. a sentence. They do not capture syntactic (structure) and semantic (meaning) relationships across collections of words and, therefore, represent language in a very naive way. [3]

In contrast, a word vector is a row of real-valued numbers (as opposed to dummy numbers) where each point captures a dimension of the word’s meaning and where semantically similar words have similar vectors. See the below images to understand the term dimension. This means that words such as wheel and engine should have similar word vectors to the word car (because of the similarity of their meanings), whereas the word banana should be quite distant. The beauty of representing words as vectors is that they lend themselves to mathematical operators. For example, we can add and subtract vectors — the canonical example here is showing that by using word vectors we can determine that:

  • king — man + woman = queen

In other words, we can subtract one meaning from the word vector for “king” (i.e. maleness), add another meaning (femaleness), and show that this new word vector (king — man + woman) maps most closely to the word vector for a queen.

The numbers in the word vector represent the word’s distributed weight across dimensions. In a simplified sense, each dimension represents a meaning and the word’s numerical weight on that dimension captures the closeness of its association to that meaning. Thus, the semantics of the word are embedded across the dimensions of the vector.[3]

What it meant for our project :

Fuzzy Matching

Before we started the backend, we developed the ML implementation of the computer vision-based text extraction. As explained earlier, that implementation of the text extraction caused lots of errors, mainly spelling mistakes and failed extractions, thus the first task for the back end was some method of error correction. We settled on using fuzzy matching to compare words against a dictionary of words, this was chosen as not only could it handle spelling mistakes but it could also solve some abbreviations (i.e. ‘SWT’ for ‘sweet’). We used the RapidFuzz python library to implement this [10]. For this to work we would need to adjust the weights of the fuzzy matching so that adding letters wasn’t as costly compared to deleting letters or subtracting them, in the end, we used weights of 1,2 and 3 for Insertion, subtraction and deletion respectively. However, as the text extraction was improved by replacing it with the API, the fuzzy matching was mainly used for abbreviations as well as checking for words that the text classifier wasn’t trained on and see if there were any close matches.

Word Vectors

One issue with the fuzzy matching method of text correction is that a lot of the time the closest matching word isn’t related to food products. One such example was having the abbreviation ‘SWT’ for ‘sweet’, using just fuzzy matching the closest matching word was actually ‘swat’ as it only requires one letter compared to ‘sweet’ which required two. So to remedy this we implemented word vectors taken from a trained text classifier to validate the results of fuzzy matching [9]. For single-word items, this was a simple task of taking the top n (we chose 5) closest matches from the fuzzy matching and comparing them against the validation words of ‘food’, ‘product’, ‘store’. Doing this allowed us to remove words that had no connection to items bought in a shop leaving the closest matches. However this task was harder for multi-word items, some words by themselves don’t relate to a food product but combined with other words they do. For example, if you have a three-word item with two correct words and one incorrect word, the most fitting word for that combination might have the lowest score when compared to the validation words when it is by itself. So because of this we mostly used word vectors for single word corrections and abbreviations.

The Finished Product

Front-End

Over the course of the project, we began to shift our priorities and attention away from the Front-End app design, instead of focusing more upon correctly implementing the text classification models. However, this does not mean that the app was completely forgotten, but its purpose has been slightly altered.

Currently, the app acts more as an interface to our backend code, where data is first loaded into our text classification pipeline. The app contains a single button, that when clicked launches the Android devices camera. Upon taking a picture of a receipt, the image is saved locally and passed to Python script (running within the app) that utilizes Aspire’s API to send the image and receive the JSON response.

We then send this JSON file to our other Python scripts, which then perform the actual text classification. While waiting for the JSON response and classification, the app actually utilizes multithreading to ensure that the app’s interface does not freeze up — instead of showing a nice loading screen instead.

After all the on-device processing has been completed, the items and their classification can be displayed back to the user, giving them a useful breakdown of their shopping habits, just as our project had initially set out to do.

Backend

Originally the goal of the backend was to receive the JSON file containing the information from the text extraction, process the data, run it through the text classifier and prepare the classified data before sending it back to the front end to render. However as the ML kit based implementation was replaced with an API that is used in python, we shifted the text extraction into the backend.

Text Extraction

The backend version of the text extraction is done using an API to connect to an online service by Aspire. They specialise in text scanning and Optical Character Recognition and have multiple tools for it such as a cloud-based OCR for documents, receipts, and invoices. For this, we would receive the image from the front-end, upload it to their cloud using the API and receive a JSON file we the data. As the text extraction was designed for a receipt, it’s able to identify the different areas on the receipts such as date, time, total cost, merchant, and it’s able to separate the items into quantity, description and price [4]. This is a paid service used mainly by companies but there is a small amount of free hourly uploads that can be used for testing, not enough to be used if we developed the app in full but enough to do prototyping.

Data-set

Before we could implement it we needed a dataset to train it on. Pretrained text classifiers exist, especially the text classifier we ended up using, but those are trained on nearly all words on the internet in all kinds of context, where we wanted one that just categories food products so we decided to train our own. Originally we used the data from an open-source product database which people could enter in their own items, however that ended up problematic. The first problem with that is that as each entry is done by a different person, they would end up putting slightly different categories than other people would. Secondly, this was made worse by the fact that the database didn’t have a setlist of categories and instead allowed users to just type the categories out, which caused a bigger difference between entries of the same items, as well as some categories being too specific. So in the end we moved on to using data scraped from supermarket websites and removed store names from items if they had them.

Text Classifier

Text classification is a field of machine learning and subfield of natural language processing that attempts to assign a set of predefined categories to open-ended text, as well as being used for learning text representations such as word vectors. Text classifiers can be used to process and categorise all types of text ranging from simple documents to receipts, text from the internet, and even medical studies. Text classification is doing by using a pre-processed set of data with a list of pieces of text (either single words or phrases) and a list of outputs for each piece of text (i.e. a tag), in our case is a list of food products each with categories for those items.

The first step in an NLP based text classifier is feature extraction, a method used to transform each piece of text into a numerical vector representation, this is also how word vectors can be created as the data inputted into the algorithm is already in numeric form [5]. A popular method of feature extraction on text is a bag of words, where a vector represents the frequency of a word in a predefined dictionary of words [8]. Then the data consists of pairs of feature sets (the numeric vectors of each piece of text) and the output tags (i.e. diary). Once it’s trained, it similarly performs predictions, converting the unseen text to feature sets using feature extraction and outputting the predicted output tag. Popular machine learning techniques used for this part of the text classifier are Naïve Bayes, Support Vector Machines and Deep Learning using either Convolutional Neural Networks or Recurrent Neural Networks [5].

The text classifier we chose to go with was FastText by Meta, an open-source lightweight library used to create text classifiers and learn text representations (such as words vectors) [6] [7]. It works on standard generic hardware so whilst we implemented it to work on a server, a model can be reduced in size so that it can fit on mobile devices and can be run within the app. Before running the items through the text classifier, the input data needs to be processed which requires it to be lower case and punctuation stripped from it. Then we performed the fuzzy matching for abbreviation correction and unknown word checking. As mentioned before, for single-word items this wasn’t much of a problem and word vectors could be used to validate them, and then they could be run through the text classifier. However, the multi-word problem still existed, so we compiled every combination of the words taken from the fuzzy matching and ran each one through the text classifier and took the combination that yielded the highest accuracy as that’s the one that the text classifier is most sure about.

The Mock-up

The team also created a mockup of what the pipeline of the classification would look like if we were to create an app that utilised the skills presented above. We went for a very clean, friendly user interface that can be easily navigated by users. The target audience is your everyday shopper who wants to keep a track of their spending and help them categorise where their money goes.

To begin with, we have chosen a simple logo of a person holding two shopping bags and we have gone for a blue and white colour wheel. Then the app progresses to a login/register page which will enable us to store a user’s data in the back-end of the app.

Once you have signed in, you are immediately hit with an overview page. Your profile details are at the top and your last receipt is displayed as well as your monthly overview. There are options to read the full receipt, scan a new receipt and view your entire history/data. Furthermore, there are options at the bottom of the screen to the three other screens; the scanning page, the analytics page and the gift page.

The Scanning Page

This page allows the user to scan a receipt or upload one from their camera roll. The receipt is then analysed by our computer vision procedure and is then classified using our classification method explained above. Once the receipt has been successfully scanned, the user is immediately taken to the analytics so they can see how their recent shop has affected their overall trend.

The Analytics Page

This page allows the user to manipulate how they view their history and data. They can choose a timeframe of how long back they want to go back to see their trends as well as choose the type of representation they want for their data (bar chart, pie chart etc). Furthermore, the page has a menu of the previous month’s trends by category, indicating whether the user has spent more or less money on a certain category than in the month before.

The Gift Page

Finally, the gift page was an idea the team thought of to further incentivise users to download the app. This is a sort of payback scheme we will negotiate with stores to Loohave their punchcard available on the app. Every time you scan a receipt from a given store, the computer vision will pick up the name and add a stamp to your punchcard for that store. When the user completes the punchcard, they will get a coupon or discount on a certain product that they regularly buy from that store. This will result in two effects; it will spur people to download the app even more as there is now a dual purpose of tracking your spending trends as well as getting discounts from shops and it will be beneficial for stores as shoppers will more likely come back to fill up their punchcards.

This is a quick summary and guide to the app. There is a short video tutorial and run-through of the app attached below which gives a visual presentation of what using it would look like!

Conclusion

This project was a steep learning experience for everyone in the team. On top of the countless times we pivoted our focus and redirected our effort, we also learned to adapt to new challenges and quickly think of new solutions to fit our project within the timeframe. Moreover, we had to quickly understand each member of the team’s strengths and weaknesses after having not known anyone previously. This was important for proper delegation of tasks and so that we could maximise efficiency. This was a real taster on how to build a project from scratch and the myriad of uncertainties and variables a team will have to consider. We are extremely thankful to Warwick AI Society for giving us the opportunity to work on such a creative idea and for selecting us to be a part of this team. We cannot wait to work on future projects with WAI!

Please find the GitHub link of this project here :

Ressources: https://github.com/WarwickAI/ShoppingTicky

  1. https://nanonets.com/blog/fuzzy-matching-fuzzy-logic/#:~:text=Fuzzy%20Matching%20(also%20called%20Approximate,are%20not%20exactly%20the%20same.
  2. https://en.wikipedia.org/wiki/Approximate_string_matching
  3. https://dzone.com/articles/introduction-to-word-vectors
  4. http://asprise.com/receipt-ocr-data-capture-api/extract-text-reader-scanner-index.html
  5. https://monkeylearn.com/text-classification/
  6. https://arxiv.org/pdf/1607.01759.pdf
  7. https://fasttext.cc/docs/en/supervised-tutorial.html
  8. https://fasttext.cc/docs/en/unsupervised-tutorial.html
  9. https://fasttext.cc/docs/en/unsupervised-tutorial.html
  10. https://github.com/maxbachmann/RapidFuzz

--

--