Author: Patrick Ramsey Dated: September 2017
For the last few months my team and I have been working to solve a problem that’s tantalized us from the early days of our company: how best should we go about extracting useful business data from the receipts that are continually being sent to us by all of our installed clients? For years, this has been the shining goal at the top of the mountain — the hidden promise of Fivestars — but we’ve so far been too busy laying the groundwork to attack this problem in earnest. I am very excited to say that we have now begun this process, and have achieved some promising early results.
The value of these documents seems clear: our client is installed on thousands of point-of-sale systems (cash registers) in small businesses and chains across the country. Early (and continuing) engineering work on that end means that most of the time we are able to detect print jobs as they are sent to the receipt printer, and to intercept text and image data from these receipts. Locked in this unstructured data are restaurant menus, clues about customer behavior, and sales trends from businesses that have hitherto mostly been out of reach for anyone attempting to learn about their behavior and performance. We are (at least in theory) in a position to detect patterns across a broad segment of small and unconnected merchants, which should put us in a better position to advise, support, and serve them than pretty much anyone else, since (unlike a payment processor) we learn about all transactions and not just credit card purchases.
The first and easiest goal, though, has always been what we call Autopoints. One problem that has dogged us from the early days has been that integrating directly with most point of sale vendors is very difficult (or even impossible) — most have no externally accessible API, and those that do often have poor to no documentation. Also, each individual direct integration involves a large expenditure of engineering time, and the number of players in that space is so large that targeting any one POS vendor for an integration is hard to justify. Without these integrations, though, cashiers end up needing to enter every transaction twice: once into the point of sale system, and once into our client to award points, which is a less than ideal experience.
Receipt data seems to offer a neat solution, though: most point of sales generate a printed receipt for every transaction, which has in it all the information we need to be able to automate our own product: namely, the fact that a transaction has just closed, and the amount of that transaction. Having at least that small part of an integration would be a huge win for us: not only would we no longer be a source of additional busy-work for cashiers, but we’d be significantly less vulnerable to fraud as well.
We have one huge advantage: the fact that receipts are mostly composed of text — and that we are intercepting printer data from the Windows print spooler and from serial port traffic — means that for the vast majority of the receipts we collect we don’t need to do any sort of OCR. Most of the receipt contents are already available as text, and can (with a little work) be extracted, cleaned up, and processed as such. The difficulty comes from the fact that while these contents are usually formatted in a way that makes their meaning clear to a human viewer, there is enough variation that trying to extract individual fields using regular expressions and rules of thumb is extremely error-prone and requires a great deal of customization per-POS vendor.
This was, in fact, our first approach: we would do everything in our power to figure out the manufacturer of the POS we were running on, and then use that fact to select a ruleset to apply to the received print jobs. While this was incredibly brittle and required constant maintenance, it ended up being reliable enough (at least for the POS vendors we chose to invest in supporting) that for the last couple of years we have been able to support Autopoints in approximately 36% of merchants running our Windows point-of-sale client. In that time, it has proven to be very popular, and has had a very significant downward effect on churn in the merchants that we’re able to support.
For the last few months, we’ve been exploring options to expand the Autopoints service to all POS clients, and to hopefully use that as a stepping stone to full receipt data extraction and all of the products that that would enable. Our current solution, however, can’t really reasonably be made to scale further than it has already — we are already continuously updating and adjusting regular expressions just to continue to support the vendors we already claiming to support. Operating on the hunch that something more general and scalable could be built using machine learning, we’ve been experimenting with scikit-learn and TensorFlow to see how far we can get.
I’ll admit that I was biased by the fact that I work at Fivestars, and most of our stack is in Python. That said, we’re working towards an increasingly service oriented architecture and so the language choice shouldn’t really matter. There is, for instance, a very compelling case to be made for working in Scala, which would further open up the spark.ml ecosystem to us. Coming to this as I was as a complete novice in machine learning, however, the array of tools available in Python was very appealing.
Both TensorFlow and Torch (Google’s and Facebook’s deep learning frameworks respectively) are either written in Python or have Python bindings. Meanwhile, scikit-learn provides a rich library of traditional ML classifiers, vectorizers, and utility functions. There’s even spark.ml support which, while it isn’t as rich as what’s available from Scala, should at least be sufficient to familiarize myself with that framework (should we decide to move in that direction). Jupyter is also excellent, both as a tool for quickly iterating and experimenting (and documenting my progress) and as a way to present my work. While it does now have support for multiple languages, its integration with Python and with numpy/scipy/matplotlib in particular has been very useful.
Since this began mostly as an effort to educate myself about machine learning, the breadth of the available tools (coupled with my preexisting familiarity with the language) made Python an obvious choice.
There are really two problems here: identifying when we have a closed receipt (ie, the document that’s printed at the end of the transaction that has payment information on it) and identifying the subtotals and total on that receipt. The second felt like the harder problem, so I attacked it first. And so it was that I immediately ran up against the first problem in any machine learning task:
Source: Cleaning Big Data: Most Time-consuming, Least Enjoyable Data Science Task, Survey SaysGil Press — https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/
Very early on it became clear that we were not going to make it far without accurately labeled, clean training/test corpora. This is a problem that has dogged us from the start — labeling documents by type (receipt / bill / credit card slip / other) is relatively easy, but how do we go about presenting the problem of labeling prices within a document in a way that a nontechnical person can clearly understand and act on?
This is the tool we ended up building. Happily, prices (at least, in dollars) tend to have fairly consistent representations, which means we’ve so far been able to get away with matching price-like substrings in each document using a regular expression. Then, it’s just a question of labeling each price as “subtotal” or “total” (with the unlabeled prices being implicitly labeled as “other”). For each file, our tool stores a whole-document label (final receipt / credit card slip / bill / other text / non-text gibberish), as well as a label for each price (represented as the label, the file ID, and the beginning of the regex match as a byte index into the file). When three labelers have submitted responses for the same file, all responses where there was consensus are inserted into another table, whose contents can in turn be exported as CSV.
So far, we’ve been relying on the kindness of our fellow engineers to beta test this tool (and thereby give us our first real data). It’s a good thing we did, too — sitting down and working through the documents we have has shown us just how messy some of them are, and how much cleaning work we have ahead of us (though we’ve made significant progress on that front in the last few weeks). It’s also shown us how ambiguous the borders between some of these categories are, which has led us to write a fairly comprehensive tutorial with a test at the end for new labelers.
We now believe that our tool is polished enough to open up to non-engineers, and so we intend to bring in temps over the coming weeks in the hopes of generating a 10,000-receipt labeled test set that we trust to be accurate.
We sort of got lucky: our first stab at this problem ended up being pretty successful, despite being somewhat simplistic. To wit:
- first divide each document up into tokens using Python Lex/Yacc (guessing that most receipts will be made up of simple whitespace-delineated tokens that are individually easy to match with simple regular expressions).
2.Treat the DOLLAR_AMOUNT tokens as our inputs, and extract features from the document around that position that we think are sufficient to differentiate them
3. Train a classifier on prices that have been manually labeled as total, subtotal, or neither.
After a week or two of feature engineering and playing around with different classifiers, we saw 99% accuracy (+/- 1%) on the small labeled dataset I’d produced by hand. Amazingly, so far, those numbers have held up as our labeled data set has grown to 1300 receipts and 13,115 labeled prices. Through guesswork and trial and error, the features we’ve settled on are:
- The three closest token types on either side of the price, with values representing their distance from said price.For instance, if a receipt contained the text
The price after the word “Total” might have the features,
2. The character position of the price within the document, represented as a number from 0 to 1 (so if the price appears exactly halfway through the document, the value of this feature would be 0.5)
3. Position in the sorted list of all price values in the document (as extracted by regex). So, if the price was the largest in the document, the value of this feature would be 1.0.
In our case, this results in a sparse feature vector with 34 elements.
We’ve achieved our best results using sklearn’s RandomForestClassifier. Bumping the number of estimators up to 30 helps our accuracy slightly, but our error bars are still just under 1% and our accuracy still just over 99%.
This was supposed to have been the easier of the two problems! There’s (at least in theory) little feature engineering work to be done, and similar problems have been solved well many times in the last few decades. It was with this problem, however, that we first ran into difficulty.
The best approach we’ve found thus far has been:
- Generate all the character n-grams in each receipt of lengths 2 through 15
- Reduce the dimensionality of the resulting (enormous) sparse vectors down to 50 using TruncatedSVD (not strictly necessary, but it doesn’t significantly harm performance and it makes data visualization easier)
- Train our model. Again, random forests have performed best.
Unfortunately, here we are running up against the poor quality and insufficient quantity of our labeled data, since we have far fewer whole documents labeled than we have prices within them. I’ve tried multiple different approaches for generating features in the hopes that they might improve our results (including transforming our vectors with tf-idf, using the custom lexer we built for the price classification problem, and using word n-grams) but the best performance was achieved using the approach listed in step 1, which has yielded approximately 96% accuracy (+/- 2%).
The (relatively) high variance is, I suspect, largely the result of sampling bias. The poorer than expected accuracy, though, probably result both from messy input data and from poor labeling. One of the issues we encountered as we were labeling documents (as receipts, credit card slips, bills, and ‘other’) was that there is a great deal of overlap and ambiguity, and creating a comprehensive rule set for labelers has taken time. As a result, we expect that at least some of the labels we have are inaccurate, and are polluting our results. Also, thus far we have been dogfooding our labeling tool inside the company and relying on people to spend some of their (limited) free time labeling documents for us. In order to have any kind of dataset to work with, we’ve reduced the number of people who have to agree on each label to two.
This project has brought to light just how dirty the data we’ve been collecting has been. As previously stated, our client integrates with the Windows print spooler, which means (among other things) that we need to have a good understanding of the intermediary format that Windows uses for print jobs. EMF (the format in question) is not particularly well documented, and the free libraries that exist to parse it are sporadically maintained at best. More to the point: the text in each receipt tends to be split into separate text fields with (x, y) coordinates within the document. Reconstructing a text document that resembles the original printed receipt as closely as possible has been a somewhat obnoxious problem, and is one area where we continue to iterate and improve.
Our client also is able to install a custom serial port driver that allows us to monitor serial traffic (for serial printers that bypass the print spooler). Extracting text here is much easier (since typically the client software is transmitting ASCII text directly to the printer) but the stream almost always contains escape sequences that are proprietary to that printer (e.g., to instruct the printer to move its print head, insert symbols, embed binary data like images, etc). Historically, we’ve mostly ignored these (since our regexes can be sculpted to mostly be able to deal with them) but they provide enough noise to confuse our model.
That said, visualizing our data using t-SNE shows that similar things do end up clustered together nicely, which gives me hope that our accuracy will improve with better labeling and more data.
Jupyter’s excellent matplotlib integration also makes it very easy to see what our outliers look like. Here is the code that generated the above plot:
Note the event handler onpick(). Clicking on any point in the plot now shows which file(s) generated that point, so we can see e.g. in the case of this highlighted cluster:
that one thing all of these disparate documents have in common is that every newline begins with a serial escape sequence. We’ve already made significant progress understanding and filtering these out, and so this at least gives us hope that our results will improve as new data that have been cleaned before labeling land in our training and test sets.
Well, that’s an open question! Everything we’ve done so far has been to solve the one specific case of Autopoints. Moving beyond that to the general problem of receipt data extraction will likely require a more sophisticated approach. Receipt parsing is an interesting problem, falling as it does somewhere between computer vision (receipt organization is very spatial) and natural language processing. So, is this a problem best approached with a convolutional neural network, or with a recurrent network such as an LTSM? How best should we represent our data (which have extremely variable length) as inputs to a neural net? How will we approach labeling when we need to label entire two dimensional regions of our text documents, particularly when those regions may be nested? And what of the small percentage of receipts that do arrive at our servers as image data?
Meanwhile, the code we have thus far mostly exists in Jupyter notebooks and hastily hacked-together Python modules. What does a service built on the knowledge we’ve gained look like? Might something like spark.ml help us scale to handle the approximately 250,000 receipts per day we’re currently dumping into s3?
These are questions we’re only just now beginning to play around with, and which I look forward to solving in the coming months at Fivestars.