Automated document processing at Alan
Alan is mostly known for its digital health insurance product. The purpose of a health insurer is to cover users for their health expenses. The health insurer needs to be informed that user A went to doctor B and spent amount X. In France, this happens mostly automatically when the user swipes their social security card. Still, there are many cases where documents need to be sent by the users themselves. Alan users can send their health insurance documents via their mobile app. We receive thousands of them every day.
Behind the scenes, Alan takes care of processing the document and determining how much to reimburse the user. Our goal is to delight the user and reimburse them the correct amount as fast as possible. This provides many opportunities for automation. In particular, our data science team works on parsing the necessary information from the documents automatically.
The life of a document
Processing a health insurance document automatically can be broken down into a sequence of steps. First we have to convert the document to a text representation so that the computer can read it. This allows us to extract particular pieces of information with natural language processing techniques. Once this is done, we can determine how much to reimburse the user who uploaded the document.
Uploading the document
The first step is to ask the user to take a picture and provide them with a good UX to do so. This may seem obvious, but this has a very large impact on the quality of the document and thus our ability to parse it. We can’t do much if the user takes a bad picture. In other words: garbage in, garbage out.
The image above is a typical example of an osteopathy invoice we might receive from a user. The pieces of information that matter are: the invoice amount (55 euros), the date the care was delivered (1st of June 2021), the name of the doctor (here we’ve anonymised it), and the name of the patient (anonymised too). Of course, extracting this information is easy for a human, but it’s quite difficult for a computer!
Optical character recognition
Document processing is made up of many tasks that fall into the realm of natural language processing (NLP). But before we can do any of that, we have to convert the documents into text. For PDFs, PDFMiner does the job very well. For images, an optical character recognition (OCR) system is necessary.
Like many other companies, we use third party OCR systems, namely Amazon Textract and Google Cloud Vision. We use both because we noticed that they exhibit different strengths and weaknesses. For instance, we found that Amazon Textract was better at recognising handwritten text. Meanwhile, the OCR from Google Cloud Vision is more accurate at placing bounding boxes around words.
Third party OCRs are controversial because the documents they process may contain personal and sensitive information. Of course, we have signed legally binding contracts to ensure these third parties cannot exploit the contents of the documents, even if it’s just for training their models. We have the ambition to build our own OCR technology one day, but at the moment it doesn’t make sense for us to pour resources into such an endeavour.
Before extracting information from a document, we need to determine what type of document we’re looking at. Indeed, the type of the document determines what information is necessary.
We found that using a straightforward TFIDF + multinomial regression pipeline does the trick. Indeed, the health insurance documents are usually easy to classify. A typical osteopathy invoice includes the words “osteopathy” or “osteopathe”. A pharmacy prescription is likely to contain “mg” for “milligram”. The precision of our model is well above 98% and the recall is very high too.
With regards to model deployment, we opted for a simple and yet often overlooked approach. We don’t introduce additional servers and APIs to serve our models. We simply load each model in memory side-by-side with our website application. We lighten the memory usage of the model by pruning the model of its non-significant parts. We like to keep things simple.
The next step is to extract specific pieces of information from the text. This task is called information extraction. We have a bunch of text and want to extract dates, identifiers, currency amounts, etc. The text we have is noisy because OCRs make mistakes, especially on handwritten text. Moreover, we can’t really use a supervised approach such as named-entity recognition because it would be too expensive to label the contents of our documents.
We decided on an unsupervised rule-based approach. We start by defining a list of regular expressions where each expression focuses on a specific pattern. For dates, one pattern will look for DD/MM/YYYY, another will look for DD/MM and assume the year is the current year, et cetera.
Sometimes the OCR might misinterpret a slash “/” as a “1” or a “l”. Likewise “rn” might become “m”. These are called homoglyphs. Our extractors are designed to handle these kind of mistakes. The goal is to establish a list of pattern extractors that together have a high recall. We have a historical database of manual parsings made by humans, so we can measure the performance of our extraction process.
We run many patterns to find pieces of information in the text. This generates a list of candidates. We then need some mechanism to decide which candidate is the correct one. Indeed, we might find multiple dates, and only one of them is the date where the document was issued. Likewise for currency amounts.
We use heuristics that are validated empirically to choose between different candidates. For instance, we know that 95% of the time an osteopathy invoice amount is a round number between 40 and 90 euros. Therefore, amounts that do not satisfy these criteria are discarded. If multiple candidates remain, we ask a human to make the final decision. For dates, we’ll select the ones that are closest to the date the document was uploaded at, and will ask for human help if there are multiple likely dates.
In the case of osteopathy invoices, we are able to extract all the necessary information with no human intervention in 50% of cases. For the other half, the OCR struggles to decipher the doctor’s handwriting.
Reimbursing the user
Once we’ve extracted the necessary information, we can decide how much to reimburse the user. This depends not only on the amount(s) specified on the document, but also on the user’s coverage plan, their age, the kind of doctor they’ve consulted, etc. All this is determined in our claim engine. The latter spits out an amount that will be wired to the user’s bank account, and a notification will be sent to the user’s phone once this has happened.
Most of the time it takes less than a minute to go from the moment a user uploads a document to the moment they get reimbursed. This creates a wow effect that delights our users, and they don’t shy away from telling us!
Document processing is and will continue to be an important topic at Alan. There are still many types of documents that we have yet to automate. Some have complex structures and require more sophistication. Whatsmore, we are opening up our health insurance offer to more countries, which implies new kinds of documents to parse.
This is a challenging technical topic. There are many opportunities to apply various kinds of artificial intelligence algorithms. We made the bet to start simple and productionalize our work as fast as possible to obtain learnings. So far this has proved successful and has yielded results. Now we want to iterate and build a more advanced document processing system that is able to deal with many more documents.
We’re definitely looking to expand our expertise in this area. If this post has piqued your interest, please consider applying for a job!