Automated medical coding using NLP

Manas Ranjan Kar
NLP Wave
Published in
3 min readNov 1, 2015

In US healthcare industry, insurance providers and doctors play an important role — both in ensuring word class care and timely reimbursements for patients. Also, quick processing of the medical records are important for the patients, more so because the healthcare costs can sometimes become very expensive. Many a times, processing of the documents is outsourced to third party organizations to manually read through hundreds of pages and extract medical codes.

Medical coding is a huge industry in itself, but a very fragmented market at that. The headcount of medical coding organizations can often run into thousands. While the model does work quite well up to a certain level, taking up additional work becomes quite difficult for a variety of factors like; resource costs, pricing pressure and tight deadlines. The average number of medical records processed per hour can hover between 2–5.

Medical coding is broadly performed on two formats — scanned PDFs/TIFFs and XMLs. The latter is from the latest crop of electronic medical records (EMR) systems and is slowly gaining acceptance. Extracting data from XMLs is relatively simpler as there are a limited number of formats and the captured data can be parsed easily.

This brings us to the challenges with scanned documents;

  • There are numerous templates, differing by doctors and insurance providers.
  • The Optical Character Recognition can be pretty noisy, additional algorithms must be created to filter invalid documents.
  • There are multiple ways a doctor can write a disease. For example, High Blood Pressure may be written as HBP as well. Algorithms to map these variations must be created as well.

APPROACHING THE PROBLEM

One of our clients who operates in the industry, processing documents for insurance providers, approached us with a clear problem statement — can you increase the productivity of my coders?

The complex problem had to be broken down into multiple parts. A single approach may not work well, what was needed was an ensemble of approaches for reducing scan noise, conversion through OCR, validating documents, information extraction, disease codes extraction and an recommendation engine.

The client had an existing knowledge base mapping disease names and corresponding ICD 9 codes. We added more to the knowledge base from the Centers for Medicaid & Medicare services (CMS) website.

Multiple algorithms were prepared for the following tasks;

  1. Cleaning the scanned documents and extracting valid documents
  2. Recommendation engine for accurate code prediction
  3. Reinforcement learning for the entire workflow to increase accuracy

Ultimately, we prepared a proof of concept with the following features;

  • Allows for both PDFs and XMLs (formats limited to the ones provided earlier in the phase)
  • Disease extraction and predicting corresponding ICD codes
  • Recommendation system built in to pre-empt codes
  • Editable values and capability to add/delete shown records

The accuracy rate for our algorithm stood at 97% with the blind sets with the built tagger and recommender system. As comparison, the gold standard for BioNLP events for ‘seen data’ is around 85%.

The overall NLP pipeline looked something like this;

MORE WORK TO DO

We are still working on the learning algorithms and hope to achieve stabilize the accuracy in the future. More can be done as far as intelligent data structures are concerned. Also, larger dictionaries would have to be incorporated to improve the recommender systems.

Have more suggestions? Please provide your comments below !

--

--