How to extract information from a medical prescription ?

Guillaume Barthe
MonOrdo
Published in
6 min readMar 2, 2021

Introduction

During my last study-year at EPITECH in the “Grande Ecole” program, I joined the MonOrdo team in September 2020 for a 6 months part-time internship, 3 working days a week.

Logo MonOrdo
Logo MonOrdo

MonOrdo is an e-health start-up founded in 2019 by Sébastien Bonnet and Léo Pechin. MonOrdo digitalizes the medical prescription by deploying french e-prescription within the health actors.

I joined the team as Data Scientist. MonOrdo didn’t have a Data Science pole, so the objective hidden behind my missions was to create one. I will present in this article the first mission I realized.

How to achieve a seamless digital transition within the French healthcare industry?

Physical paper prescriptions are omnipresent in healthcare systems. This shows several problems:

  • How to track a patient’s record from one healthcare structure to another?
  • How to prevent wasting time when transferring a medical record?

This leads us to the following problem:

How to digitalize and extract in a simple and efficient way the information from a medical prescription?

Automatically understanding a prescription allows us to retrieve, understand and extract all the important information it contains. This includes patient information as well as the posologies. This is one of the major stake for the digitalization of the French Health system.

Model

The model is built around 3 parts. I will present them separately before explaining how they are linked together.

Introduction — Technical parts

From the start, all the prescriptions are converted to “.png” format since the model input excepts an image and not a “.pdf” file.

First of all, it is important to know that no tool exists to perform this task. I choose to use three existing algorithms and to merge them to create a new one to achieve the expected results.

Object detection

If you are not familiar with this term, it’s a computer vision process for identifying an object (and its bounding boxes coordinates in pixel) in an image.

The first step is to extract and differentiate chronic and acute conditions. Drugs are not treated in the same way afterwards, so it is important to separate them from the start.

I therefore obtain two categories:

  • Chronic conditions (labeled ALD: affection longue durée in French)
  • Acute conditions (labeled Base)

Each part is extracted from the image of the basic prescription, so I end up with one (if there is no ALD on the prescription) or two “extracts” corresponding to blocks of labeled drugs.

The second step is to identify the drugs within the previous “extracts”. The goal is to separate the drug blocks in order to facilitate the work of the OCR and then the NER.

By passing the OCR directly on the entire prescription, it is difficult, afterwards, to implement an “intelligent” algorithm to identify precisely ALDs. This is also true for the identification of drug groups and that’s is why OCR is used in a second step. This method is slower in execution, but much more efficient and precise for the final results.

To perform the treatment, the drug blocks are simply sent back to the algorithm.

I obtain so two levels of extraction: a first category of “extracts” corresponding to the types of conditions and within the latter a second category corresponding to the drugs.

Here is a schema summarizing the role of the object detection algorithm:

Object detection pipeline — Schema
Object detection pipeline — Legend
Object detection pipeline

This part of the final model is trained via “Transfer Learning” technique in order to improve learning and quickly converge to better accuracy and a much relevant result. I used the architecture and weights of the yolov5 model.

The dataset is composed of a base of pdf medical prescriptions transformed into images. A data augmentation technique was then applied in order in one hand to increase the size of the dataset and on the other hand to show to the algorithm different new cases and simulate cases where the picture or the scan of the prescription would be of poor quality. Since the format of the prescriptions changes according to the medical software the medicine uses, this part is even more important so that the algorithm does not overfit on a particular type of prescription.

To summarize, this part takes as input an image corresponding to a medical prescription and outputs the different drugs classified by type of condition.

Optical character recognition (OCR) and Named entity recognition (NER)

The next step consists of transforming the text inside the extractions into digital format. It’s at this point that the OCR is introduced. I will simply retrieve each drugs block in digital format.

Finally, the last part consists in extracting the following information: the name of the drug, its type (tablet, powder in drinkable solution, …), its frequency, and the duration of the treatment associated with it. The extraction of named entities (NER), is the algorithm implemented to perform this task.

This part of the model is trained on a 500 prescriptions dataset (mean of 4 drugs block per prescription) and validated on a more or less 100 prescriptions dataset.

The training dataset was built from a medical prescription generator. To build this generator, I inspired myself by several hundreds of prescriptions by analyzing how the blocks of the different drugs were related each other. The goal was to reproduce the most accurate posologies.

One of the biases encountered was the multitude of medical software used by doctors. These softwares issue prescriptions that are relatively different in terms of posology syntax. So I made a generator as general as possible able to cover the syntactic differences of the multiple medical softwares.

During this last step and in order to partially counter the previous problem, a text cleaning/normalization is applied (text preprocessing). One of the key points of this cleaning is the transformation of medical abbreviations into full words. Here is an example:

DOLIPRANE 1000 mg Gél Plq/8 Prendre 1 gélule le matin, le midi et le soir, pendant 1 semaine

becomes …

doliprane 1000 milligramme gelule prendre 1 gelule matin midi soir pendant 1 semaine

The word “prendre” represents an action and the word “pendant” represents time. That is why they are kept.

To summarize, this part takes as input the blocks of drugs detected via object detection and outputs the labeling of the latter.

Coordination between parts

All the parts have been wrapped together in a Python module and I adapted the inputs/outputs of the different components so that everything works together. Here is a schema representing the global architecture:

OCR-NER Pipeline
OCR-NER pipeline

Finally, the final model has been integrated into a Python API (written in Flask) to deploy and use it in a very simple way.

Go further

Due to the sensitivity of the medical data, all model outputs are checked at each use. In case of error, the data is corrected. This process not only guarantees the reliability of the prescription output, but also allows the model to be corrected as it is used, and improved.

If you are interested in e-health projects, have a look at MonOrdo

Thank you for reading this article.

Guillaume Barthe

--

--