Mining the medical record: how we made faxes and scans full-text searchable

Feridun Mert Celebi
PatientBank Engineering
8 min readJul 10, 2017

At PatientBank, we help people gather and share medical records. As part of this effort, we recently launched a revamped version of PatientBank that makes it easier to search, annotate, and share medical records — not just request them. You can signup right here.

Store, search, and share your medical records with PatientBank

One of the most challenging parts of this process is letting users to granularly search their documents. This is hard because most of our medical records are PDFs that have been printed, scanned, faxed, or all of the above. That’s just the state of health information exchange.

Medical records are complex and rich with information, yet the contents remain inaccessible. Due to a fragmented EMR market and customized deployments at hospitals, there is no standard format, and records are interspersed with handwritten notes. Ultimately, we want to be able to convert PDFs into structured medical data. But that’s a lot to accomplish all at once. Our users kept asking for easier ways of finding details in their medical records, so we decided to start with full text search.

Initial spec

The user experience we want requires two things:

  1. A user’s search should reveal every match in all of their documents, including page numbers and context for the match, sorted by relevance.
  2. A click on a particular match should take the user to the specific page on the document with relevant search terms highlighted.

In this post, I’ll explain how we implemented the first version of our fully-automated image to full-text search pipeline using only open-source software. In the end, I’ll offer a couple ideas as to how we plan on refining it.

Optical character recognition (OCR)

Processing structured PDFs (ones that contain actual text, not images of texts) is trivial. There are many open-source tools that allow you to do extract text from structured PDFs. For unstructured PDFs, OCR is the way to extract text from an image, and there are many available tools from large companies (i.e. Microsoft Computer Vision API) and the open-source community alike (i.e. Tesseract).

For our implementation, we decided to use Tesseract. It can detect many languages out of the box, supports flexible input and output formats and provides tools to train your own OCR algorithms on top of the Tesseract platform. This would allow us to train Tesseract to recognize handwriting in the future, but handwritten notes in medical records is a whole other beast.

Gotcha #1: If you are using a wrapper around Tesseract, such as RTesseract or ruby-tesseract, you will have to also install ImageMagick and possibly other tools. We consistently had trouble installing these tools on our development machines. Because we already used Docker in production, we started using Docker for Mac in development as it is much easier to add a couple RUN apt-get install … lines to our Dockerfile to easily install these tools. Alternatively, you can check out some of our guides about RTesseract and ImageMagick/RMagick to ease out the process.

Regardless of your OCR setup, it is important to remember that OCR is not perfect — based on the quality of your images, the accuracy of matches could be low. Luckily, there are ways to improve our OCR implementation, which I’ll explain later.

Image in, text out

When a new document is uploaded to PatientBank, it is automatically be placed in a Resque queue for processing (the process currently supports PDFs).

Our Resque jobs take a document, download its contents from Amazon S3, convert its pages to TIFF files, pre-process the pages to remove noise and increase contrast, send the pages to Tesseract and store the results.

# Sample code to parse a given blob
# Produces an array of ImageMagick pages
def parse_blob(blob)
Magick::Image::from_blob(blob) do
self.format = "PDF"
self.quality = 100
.
.
end
end
# Sample code to preprocess a given page
def preprocess_page(page)
page.colorspace = Magick::GRAYColorspace
page.format = "TIFF"
page = page.contrast(sharpen = true)
.
.
page
end
# Sample code to OCR a page
def ocr_page(page)
bounding_boxes = RTesseract::Box.new(page, lang: "eng").words
# bounding_boxes.first
# => { :word => "PatientBank", :x_start => 152, :y_start => 100,
# :x_end => 329, :y_end => 122 }
.
.
end

Gotcha #2: If you are using Resque in production, you may run into memory issues as you process large files. This could result in Resque throwing an esoteric Resque::DirtyExit error. We found keeping Resque jobs short by processing documents page-by-page and memory usage low by replacing RMagick with MiniMagick to be useful.

Me typing angrily, whenever I get a Resque::DirtyExit error

Calling RTesseract with a given page produces bounding boxes for each word RTesseract recognizes. These bounding boxes include text and the x and y coordinates for the top-left and bottom-right corners. We decided to store OCR data in two new tables: DocumentPage and DocumentWord (as seem below). Bounding box data data is stored on theDocumentWord model. A concatenation of all the words on a given page is stored on the DocumentPage model as the full_text field. At a high level, each Document has many DocumentPage instances and each DocumentPage has many DocumentWord instances.

A simplified UML diagram of the schema

Gotcha #3: Using RTesseract, you may be tempted to pass in your PDF directly for OCR. This could lead to lower accuracy in your matches. Tesseract can’t directly parse PDFs. So wrappers such as RTesseract convert the passed PDF to TIFF files, but do so with lower resolution and color depth. So, converting your PDFs to TIFF files in the quality level you desire could yield better results.

An alternative strategy would be to eliminate the concept of a DocumentWord to store all the granular coordinate information on a JSON blob or as an hOCR output (OCR results in HTML markup) from Tesseract on the DocumentPage model. We opted out of these approaches for two reasons: parsing large JSON blobs (or hOCR outputs) from PostgreSQL is messy, and indexing in Elasticsearch is even messier.

Indexing and Elasticsearch

We use and love Elasticsearch (and Chewy to interface with Elasticsearch) for indexing and search. Elasticsearch is an open-source tool built on top of Apache Lucene that builds an inverted index (in our case, specified by Chewy’s DSL) for faster search.

It supports full-text search with highlighting and sorting by relevance. In addition, it provides valuable tools such as token normalization, stemming, stop word filtering, fuzzy matching (for misspellings) to deal with OCR or human errors. I won’t go into specifics of Elasticsearch, but you can learn more about it here.

For our purposes, we use a composite Chewy index that references multiple models — Document, DocumentPage and DocumentWord. When a user searches all their documents, the full_text field on DocumentPage instances is used to return highlighted results, sorted by relevance.

# Query to search all DocumentPage instances for a user
# with custom highlighting
DocumentSearchIndex::DocumentPage
.query(
match: {
_all: @search_query
}
).filter(
term: {
'user.uid' => @user.uid
}
).highlight(
fields: {
full_text: {}
},
order: "score",
pre_tags: ["{{{"],
post_tags: ["}}}"]
)

Once a user clicks on a particular search result, the frontend queries every DocumentWord in the document. It highlights every match using the word’s bounding box from Tesseract.

# Query to search all DocumentWord instances for a given @document
DocumentSearchIndex::DocumentWord.query(
match: {
_all: @search_query
}
).filter(
term: {
'document.uid' => @document.uid
}
)

Gotcha #4: The biggest problem with having a full_text field on DocumentPage instances is that the search and the relevance of the results is scoped to the page, not the entire document. So, if a user searches for a phrase that is spread to multiple pages, the search fails to take that into account when calculating relevancy. Elasticsearch recently solved this issue with Aggregations and Field Collapsing. Similar solutions exist for other indexing technologies (such as Solr).

Displaying the search results

For full-text search, our frontend makes two separate queries–one that searches the pages of all of your documents, and one that searches the words of a single document).

For the first query, the results are grouped by documents and include the document and an array of pages. The array of pages include a page_number and matches. The matches array includes words marked with custom{{{ and }}} pre-and post-tags, as specified in our Elasticsearch query.

# Sample response from the backend
# Search query: back pain
# Document: my Yale Health medical records
{
document: {
name: Yale Health Medical Records
.
.
},
pages: [
page_number: 4,
matches: [
O: "... Experienced {{{back}}} {{{pain}}} in Turkey...],
1: "... Carrying a heavy backpack {{{Back}}} {{{pain}}}?..."
.
.
]
]
}

The interface:

Search results for “back pain” in my Yale Health medical records

When a user clicks on the actual search result, the interface makes a second query that uses the DocumentWord model to return an array of matches. These matches include coordinates for the words which help highlight them on the page.

Highlighting actual word matches on the document

Future possibilities

As mentioned before, OCR is not perfect. However, it is possible to improve OCR performance considerably.

  1. More sophisticated image pre-processing: Currently, we have a basic, one size fits all approach to image pre-processing. Better, more intentional pre-processing would highly improve OCR accuracy as each PDF is a little different. While some documents don’t need any pre-processing, some (the ones that get scanned again and again) need heavy manipulation.
  2. Fine-tuning Tesseract: The Tesseract technology is constantly being refined. The tool is extremely customizable with 13 different page segmentation modes and 4 different OCR engines and the support for training your own OCR algorithms. Similar to point 1 above, it would be possible to have more intentional configurations for different types of documents for increased accuracy and/or speed.
  3. Fine-tuning Elasticsearch: For full-text search, we currently use a simple implementation of most Elasticsearch tools. Elasticsearch is extremely powerful and the documentation offers nuance into better ways to do stemming, tokenization and handling misspellings.
  4. Human-machine collaboration: In the future, there could be a more active feedback loop between the users and our OCR pipeline. Allowing users to correct OCR matches could be valuable.

Regardless, we are excited to continue iterating on our full-text search feature. We strongly believe that making all the information hidden in medical records more accessible will be beneficial to both our users and their doctors. This is only the first of many steps towards accomplishing that goal.

P.S. If you liked this post, have comments or feedback, I’d love to hear from you! Drop me a line at mert@patientbank.us or reach out to me on twitter @trembleice.

--

--