Fake it until you make it

Felix Le Chevallier
Lifen.Engineering
Published in
4 min readNov 16, 2018

--

Our first AI — © commitstrip.com

How we built our AI to help operations and improved it until it became good enough to be consumer-facing.

At lifen, we want to streamline medical communication and we are focused on fixing the sharing of medical reports between healthcare professionals.

Back in 2015, the typical process for sharing medical reports would be something like this:

  1. Type your report and manually put it inside the envelope
  2. Send it to your recipients via postal mail
  3. Your recipients would then scan your document and manually put it in their Electronic Health Record (EHR)

Healthcare IT is not exactly an agile environment. The market is fragmented, with decades-old providers which are not ready to provide easy APIs to share structured health data. Users are not always tech-savvy, and they want — and should! — be able to spend most of their time actually healing people instead of typing on a computer.

Long story short: we found our product-market fit by relying on something every user and every EHR knows how to do: (virtual) printing.

The way it works is simple: we install a remote printer onto their infrastructure and receive the PDFs when users send a printing job. We are left with one task: extracting metadata in order to send the document:

  • Recipients and their postal address (which we mostly use to match practitioners to our directory — 8 times out of 10 documents are eventually sent via secure online health messaging systems)
  • The patient and their birthdate
  • The document type (Consultation Note, Hospital Discharge Summary, Surgical operation note …).

As with the old print-and-mail way of communicating, this destroys all metadata that may have been contained in the original EHR. We have to find structured information hidden in the text.

We quickly built a back-office that allowed us to fill in these metadata. We started with simple account-specific heuristics, mostly regexes and positional rules. For instance, we kept track of all the coordinates of all the addresses we found in our documents (that is, for each address, we stored the tuples {xmin, ymin} = upper left corner and {xmax, ymax} = bottom right corner of the smallest box that contained the address in the a4 letter). We then ran postal code and doctor title regexes inside the most common bounding-box to find the name of the recipient.

Bounding boxes for recipient addresses in A4 pdfs. Red dots are the upper left corners and green ones are the bottom right corners. Most addresses are naturally placed in the upper right part of the document

As we were experimenting a strong growth (we recently reached the 1 million documents milestone!), we needed a way to scale operations beyond a few accounts where we could find reliable heuristics, so we built an AI for these tasks. After processing our first couple thousand documents we had a labelled database that could pave the way for supervised algorithms. These heuristics quickly became our core features for our algorithms that detect the first and last name of the patient, the document type, the main recipient and the secondary recipients …

Fast forward 18 months later, we are now sending the vast majority of our clients’ documents without human supervision. Our main metric is precision over recall, as we cannot afford to send the document to the wrong recipient. We are slowly raising the ratio while maintaining a precision which is greater than human performance, through rigorous quality assessment and sanity checks.

Thanks to 6 machine learning algorithms (Extra-trees and LightGbm for those interested) and one NLP syntactic address parser, trained on a daily basis to keep up with our rate of new users (on-premise training since we are dealing with highly sensitive data), we are now serving a couple dozen thousand predictions in production per day.

We are in the process of deploying a new version of our app that lets users validate by themselves our predictions in real-time and send their document themselves.

It’s clearly a win-win situation as we reduced the processing time for our users. They can instantly see the information extracted from their document, and they don’t have anything to change: for the most part, they just validate the results (Bonus: they are doing the QA !) On our side, we have less manual work to do, and we can focus on enhancing our product instead of doing QA.

As for our users, they can go back to focusing on healing patients :)

Thanks to

for his help building a robust on-premise CI/CD training pipeline and Flavien Gilles our ex-intern, now a full-time employee for joining the team and helping to tackle this fun challenge. We have many open positions (including Machine Learning stuff), so feel free to drop us a line if you want to help us achieve our mission!

--

--