Geek Culture
Published in

Geek Culture

Extracting Data from PDFs: 4 Valuable Tips from an Expert

Looking at the results of a recent blood test got me to thinking about the difficulty of extracting data from PDFs.

Here’s an Example of Tough-to-Extract (and Tough-to-Understand) Data from PDF Files

Ahead of an upcoming annual health checkup, I completed a blood test. Eager to see the results, I downloaded the test data as a PDF document from the new fancy web portal my provider offers. And wouldn’t you know — it was not easy to read.

As a data professional, I know good labeled data when I see it, and this sure wasn’t it.

I’ve had more than 40 different blood tests performed on me over the last decade (I have high cholesterol — it’s a genetic thing — so I watch my cholesterol numbers and triglycerides closely).

And I had no idea what most of the tests were.

Taking on projects is the best way for me to learn things. And I suspect that’s true with most people.

So I gave myself a project to extract data from the bloodwork PDFs that I received. I then highlighted the data that were “out of range,” (like my cholesterol if I don’t take my medication).

By the end of a project like this, I may not know what the different tests are, but I will know my data. And I’ll see the variations of data across different years (I get two tests a year on average).

The point I’m making here is that I had to get intimate with my data in order to use it properly.

Anyway, my recent bloodwork data project got me to thinking about a PDF data extraction demo I’ve been working on. Based on that, here are 4 tips I’ll pass along below:

PDF Data Extraction Tip #1: Understand the Reality of Your Data

Here’s a recent scenario I faced: A client sent me 9 documents comprising 125 pages for an industry I’ve never worked in — Insurance. And, as expected, in a very unconsumable format.

My project was to create a data model to extract that difficult data from PDFs.

In every single PDF data extraction project I’ve come across, there’s a gap between the customer’s understanding of their data and the reality of their data.

And that’s just how it goes and why outside eyes are a big help.

To perform this complex work, you’ll need intelligent document processing technology. But don’t lose sight of the fact that the reality of your data doesn’t change — complex data extraction will always require a more complex solution.

PDF Data Extraction Tip #2: Start with Good Business Processes

So the demo I’m working on is focused on insurance estimates. To build models for data extraction, I need to know and start with the required data.

Thankfully, this client has their business processes fully documented and gave me a list of required fields. Knowing exactly what information is critical for downstream processes makes a proof-of-concept exercise much more valuable.

The reality of extracting data from PDFs is that some data may simply be too cost-prohibitive to collect without a clear understanding of the business outcomes.

Is the business outcome high-value? Then more time and costs should be allocated to the solution.

PDF Data Extraction Tip #3: Use Key-Value Pairs (The Low-Hanging Fruit)

In my demo, I started off with something easy — a key-value field that looks for a “total cost” on the documents. It’s always some variation of the word “total” and a currency figure in close proximity. That’s what a key-value pair is.

So I configured a key-value pair extractor modeled to look for this association of a particular word(s) and currency data nearby.

I’m working in the Grooper data extraction platform which makes it easy for me to step through each document one-by-one to test extraction results.

I can even click on “Test All” to test all the documents in the current batch (the nine documents with 125 pages), and I’ll get a little red flag on any document where extraction failed to produce a result.

Using a Little Data Science for Easy Results

I see results in real-time and make adjustments to my data extraction model as needed.

This process is called Textual Disambiguation. It’s a data science term that basically means creating a hypothesis and testing it iteratively.

That’s what Grooper facilitates.

No other product I’ve ever worked with makes this iterative process so easy to do.

After configuring around 10 fields for extraction, I have a good idea of the data set I’m working with. I’ve iterated these documents at least ten times (once per extractor), and now I know the data really well.

PDF Data Extraction Tip #4: Use Data to Tell The Story

I can tell a story with the data as I demonstrate the proof of concept. For example, I can see things such as:

  • “This document from XYZ Insurance Company doesn’t have any adjuster name.”
  • “This one from ABC Indemnity was a malformed PDF.”
  • “And these over here — from Intelligent Coverage Co — are in great shape. And we can extract all the data needed.”

Again, I’ve gotten intimate with the data set.

This is important for potential customers to know. The process of learning someone else’s data, and becoming intimate with it, is iterative, and it takes time.

It’s why we’ve gone to a different service model that helps facilitate this type of process rather than trying to figure out a Scope of Work in the presales cycle. This is something that, in my opinion, has always been contentious.

We have to learn a customer’s data to really understand if what we’re extracting is meaningful. But more than that, we have to be able to explain the data extraction story.

Here’s Why Telling the Story of Someone’s Data Extraction is So Important

We have to be able to explain the pros and cons of what we’ve found so that the prospect has a clear path to achieve their desired business outcomes.

But this isn’t an easy story to tell because it takes work.

It’s easier to just say, “We get 99.99% of your data, guaranteed.” But anyone who says that without first really getting intimate with your data is just trying to sell you something.

And I’ll bet good money it doesn’t work.

PDF Data Extraction Problems are Complicated

Issues dealing with data extraction take time to understand, unravel, and perfect.

We’re in a strange place in the automation industry right now.

Companies are touting terms like AI and ML without really having any wood behind the arrow.

Just do a quick Google search on “Realities of AI,” and you’ll quickly see that most of what’s out there is marketing hype at best.

So we built Grooper software to solve difficult problems like PDF table extraction, and classification. And the goal is to help users extract data from PDF files and scanned documents with little or no manual data entry.

We use several components of AI, and we’re transparent about it. We’ve built the extraction tools you need:

  • Computer vision algorithms
  • Natural language processing (NLP) features
  • Supervised Machine Learning algorithms for both classification and extraction

The key to extracting accurate information from complex data is how these innovative features have been combined to get data from PDF documents.

Software Built Specifically for Extracting Complex Data

We didn’t take off-the-shelf components (which haven’t been built with PDF document processing in mind) and ram them into the product.

Our development team studies advanced data science techniques and discovers how to use them with PDF data extraction and automation projects.

We’ve built our own intellectual property into Grooper to be the best platform for extracting difficult data for demanding business processes.

If you’re tired of vendors overpromising and underdelivering, give us a call. We’d be happy to discuss how Grooper can help extract data from PDFs, if it can.

And if it can’t, we’ll tell you that right off. And we’ll even explain why.

Originally published at



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jesse Spencer-Davenport

Jesse Spencer-Davenport

I enjoy solving problems through business process analysis and increasing revenues through excellent content marketing.