Working on Document Processing, early 2023

Published in

Alan Product and Technical Blog

7 min readMay 25, 2023

At Alan, we currently receive ~17k documents every week from our members: invoices, prescriptions, documents from the French administration especially Sécurité Sociale, etc. This number scales directly with our member base. It represents a few million euros per year of processing by humans.

Document processing is a critical part of our operational stack. We need to understand the information in documents accurately in order to take timely action: reimburse a member, update a status somewhere in our backend…

We want to process as many of these documents automatically because it has 2 strong advantages:

It’s cheaper, so it allows Alan to offer the best prices to our customers.
It’s faster, so it provides a superior user experience and reduces members’ mental load. It creates delight.

What we started from, mid-2022

Up to mid 2022, mainly by automating one very frequent and easy type of documents (osteo invoices), we were able to automate ~33% of documents. This is our main KPI. We like that it is simple.

Another KPI is the error rate among automated documents. We are fine with some errors as long as their impact is under control. We certainly do not want to be wrong on crucial operations. For instance, it is ok if we over-reimburse members a little in some difficult parsing cases — as reducing the error rate to ~0 would end up being more expensive to Alan than allowing for these errors. We want to keep the error rate below 5%.

Our past technical approach has been detailed in a previous blog article here.

TLDR: when we were receiving a document:

We were leveraging an OCR (Google or Amazon) to transcribe it
We were classifying it using a pruned logistic regression on the transcription
We were applying regexes to extract information (in the cases in scope)

We had made the conscious decision to stop our efforts there. At Alan, prioritization is key. We considered that 33% was good enough back then, and switched our focus to another project.

End of 2022: Back to it!

In Q3 2022, new priorities arose in the company, as illustrated by the motto “Profitable by 2025, delightful always”. In the current economic context and given Alan’s maturity, we started planning more precisely for profitability in the short and medium term, while continuing to increase delight in our product. Automating document processing was identified as a main lever.

Consequently, we re-created a cross functional team (a “crew”) in order to process as many documents automatically as possible. In the team, we have 3 main contributors: 1 engineer (Nicolas Fortin), 1 ops (Lætitia Lefebvre-Naré) and 1 data (Hubert Jaouen, writing this blog post). Our CTO Charles Gorintin was also contributing at different stages of the project.

In order to reconcile with our cost to serve ambition, we set a goal of automating 60% of all documents by the end of 2023. We bet on the rise of new ML techniques to get us there. We knew that the regex approach was not scalable and too hard to maintain.

Transformers to the rescue!

Let’s dive into the new approach that the crew came up with.

Disclaimer: we’ll focus on what we have today. But keep in mind this is a constantly evolving topic, so this blog post may be outdated soon.

There are 2 kinds of documents: simple vs complex. Let’s treat them separately.

Simple documents

Simple documents are documents for which

Our expectations are simple: we need to extract only a few very well-defined, straightforward elements
The document layout is simple: the elements we are searching for can be found directly in the document, presented in an easy-to-understand and exhaustive manner.

For these, we found that simply using GPT with a good prompt works very well for the parsing part!
We perform the OCR using Google/Amazon, we categorize and subcategorize the document (using our old logistic regression — but there as well we are considering turning towards GPT), and depending on the outcome, we use a specific prompt to parse out the data we need.

We managed to increase the automation rate dramatically. For example, we got ~90% for contact lenses prescriptions, ~80% for pharmacy prescriptions, ~70% for glasses prescriptions, ~75% for chiropractor invoices… This with minimal effort compared to any other ML method.

Of course, we leverage a version of GPT hosted on Azure in West Europe that is compliant with sensitive health data (HDS certified and GDPR compliant).

Note that handwriting remains an issue sometimes at the OCR step. Ideally, we would directly give the image with the prompt to GPT, which should be possible with a multi-modal version of the model.

Complex documents

Complex documents are another beast because we need to extract a lot of information, and everything is not always simple to find or even explicit in the doc. We often need to make inferences.

For example, a pharmacy invoice is complex. We need to extract a lot of data. Additionally, each invoice is differently incomplete: some may not mention fields like tax rates (because they can be inferred from a standard code), while others won’t mention medicine codes (because they can be inferred from medicine name) etc.

The pipeline that we have landed on for these ones is:

OCR (Google or Amazon)
Categorization / Subcategorization (logistic regression & GPT)
Named Entity Recognition
Entity Relationship Extraction
Post Processing, implementing a lot of business logic

Named Entity Recognition

We use the LayoutLM family of models. This multimodal Transformer from Microsoft Research is available on HuggingFace. It takes as inputs the result of the OCR (words + bounding boxes) as well as the image of the document, and returns entities as output: prices, tax rates, drug names…

To train the model, we needed labeled data. Unfortunately, our existing parsings could not be used for this purpose! Indeed, our team of human operators is applying “hidden” extra logic on top of the raw information extracted from the document: inferences, transformations, business rules…

We had to label data ourselves. Ops took this task and involved external operators, which significantly sped up the process. Finding the right tool to do so was difficult because our model is multi-modal, and few tools have great multimodal annotation UX. We ended up picking Scale.ai after a timeboxed benchmark, because we felt it provided the best UX, leading to faster, cheaper annotation; as well as the best integration with our technical stack. Scale.ai was also able to adapt to our very specific and demanding data security requirements.

We use Amazon Sagemaker to handle model training, hosting and running inferences. Alan is already using AWS, so that made things easier. On top of that, the team appreciated that Sagemaker allowed us to focus on impact instead of infrastructure. Running inferences is as simple as the code below:

Entity Relationship Extraction

Once we have identified entities in the document, we need to group together the ones that are related to the same action of the health professional — that we call a care act. For example, we need to know that price “10 euros” and tax rate “10%” relate to care act A, while price “20 euros” and tax rate “20%” relate to care act B.

Fortunately, related entities are almost always on one same line in all our health documents: this makes our task easier! Yet, geometry is often altered when scanning a document, so straight lines can end up distorted. We found layout detection from AWS to be pretty bad at dealing with this problem.

Thus, we need to recognize lines in the document. To do so, we decided to represent lines going through our entities as quadratic functions.

We need to fit these lines optimally in the doc. It’s a simple optimization problem:

For example, for an invoice, we end up with something like this:

Post Processing

Once we have the information from the doc, we may need to post-process it: make inferences in order to obtain all the information we need given its category, and apply business rules.

As a simple example, the pharmacy invoice shown above does not mention the price TTC for each care act, but we do require this information on our side. Therefore, we need to compute it by multiplying the HT price by 1+TVA.

Sometimes, the inferences are way harder. For pharmacy invoices, we actually need to integrate with Vidal services in order to know the amount that Sécurité Social reimburses for some drug IDs that we parse.

Results

Thanks to this method for complex documents, we currently reach an automation rate of eg ~50% for pharmacy invoices. We still have some optimizations in mind that could push it further.

Next steps: scale it up!

Now that these 2 methods have been validated, we need to scale them to many more document categories. Each has its own subtleties, especially regarding the post-processing.

We also have other important workstreams in parallel, like improving our categorization step, or exploring new developments in the LLM world.

We are willing to invest time exploring new ideas, even if it might lead us to deprecate recent work: things are moving fast and we need to be flexible and adapt. Calibrating our appetite for exploitation vs exploration in the crew is a key decision.

Among ideas that we find interesting: fine-tuning GPT, enhancing GPT with dynamically retrieved examples that look alike the document we get in input using embedding similarity, trying multimodal GPT4 out of the box when available with proper security and privacy features, and many more.

If you’ve read that far, you’re passionate about using Machine Learning for concrete impact. You should reach out! We’re always hiring exceptional people.