Image for post
Image for post
Image Source: medium.com

Parsing ICD Codes With Python

Deriving data from PDF files: What do you do when the data you need isn’t easily accessible?

I recently found myself needing to parse out over 2,000 ICD codes from a PDF file at the start of a proof-of-concept project. I have a reasonable grasp on REGEX and more than novice experience in python. Further, I have parsed some data and tables from PDF’s in the past, but the table layout in the PDF limited the options available to parse this data. In the past, I have utilized Tabula-py with good success. Unfortunately, I was having multiple issues getting Tabula to parse out the data I needed in this case. If your interested in the original pdf file, the link is here. The remainder of the article discusses the method I used to quickly grab this data and make it usable for this project.

I started with standard imports:

# Standard importsimport pandas as pd
import numpy as np
import re

For this project, I used PyPDF2 although there are multiple available options. You can find all the documentation needed to use PyPDF2 here.

# Import PyPDF2 to parse text from PDFimport PyPDF2

Using a simple while loop, I was able to quickly iterate over all pages and extract out all the text.

# Open pdf, iterate throuugh pages, parse text per page to text 
# variable, close file
with open('icd.pdf', 'rb') as doc:
doc_reader = PyPDF2.PdfFileReader(doc)
num_pages = doc_reader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = doc_reader.getPage(count)
count+=1
text += pageObj.extractText()

text = text[:176500]
CPU times: user 4.51 s, sys: 0 ns, total: 4.51 s
Wall time: 4.5 s

Overall, the loop was reasonably fast given that it was 78 pages. The bottom of the text contains excluded ICD-9 and ICD-10 codes that I wanted to exclude. A simple slice removed these exclusions. Next I needed to develop a REGEX expression to get all the data I needed out of the text.

# use REGEX to find ICD10 and ICD9 codes of interest

icd_re = re.compile(r'[QDEGKLMPOZ]?\d+\.?\d+|\d{3}\.\d.')
icd = re.findall(icd_re, text)

If you aren’t experienced with REGEX the concepts are simple but can be confusing when you are first learning. The best resources I found to learn from included a class on Udemy by Mudassar Naseem “Complete Regular Expressions Bootcamp” and RegexOne. The tutorials are very straight forward and contain high value information that was easy to adapt to programming needs.

At this point I had all the ICD9 and ICD10 codes that I needed. but a close look at the PDF shows that there are some exclusions embedded inside the document besides the ones removed by slicing the text. Generally, a lookahead/lookbehind expression can help with this also. Below is one such expression that looks for an ICD9/ICD10 pattern following the ford ‘exclusion’.

re.findall(r'(?<=Excluding)[QDEGKLMPOZ]\d+\.?\d+?|\d{3}\.\d+',text)

Unfortunately, in this case, the formatting in the table and overlap between exclusions in certain ICD9/ICD10 codes that end up becoming inclusions in other parts of the document made this approach impractical. This is an excellent reminder that as data scientists, it is imperative to understand the industry context and specifics of the problem we are working to solve. In this case, it was faster and more accurate to scan and remove exclusions manually than with lookbehind parsing.

Finally, I joined the list into a string format for easier entry into a SQL statement wrote it to a .txt file for future use.

# Clean data for use in both SQL and M2
icd.sort()
icd_text_SQL = ', '.join(icd)
# Write files
with open('icd_text_sql.txt', 'w+') as f:
f.write(icd_text_SQL)
print(icd[:10])

And here are the first 10 ICD codes after sorting

['237.70', '237.71', '237.72', '237.73', '237.79', '255.2', '255.2', '255.2', '259.4', '259.5']

That was it. Below is the full code. Overall this was a quick job that challenged my understand of REGEX and some advanced REGEX functions (lookbehind). I hope you found it helpful and would be interested to hear how you would have approached this problem or made it more pythonic.

# Standard importsimport pandas as pd
import numpy as np
import re
# Import PyPDF2 to parse text from PDF
import PyPDF2
# Open pdf, iterate throuugh pages, parse text per page to text variable, close file
with open('icd.pdf', 'rb') as doc:
doc_reader = PyPDF2.PdfFileReader(doc)
num_pages = doc_reader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = doc_reader.getPage(count)
count+=1
text += pageObj.extractText()
text = text[:176500]
# use REGEX to find ICD10 and ICD9 codes of interest
icd_re = re.compile(r'[QDEGKLMPOZ]\d+\.?\d+?|\d{3}\.\d+')
icd = re.findall(icd_re, text)

# Clean data for use in both SQL and M2
icd_text_SQL = ', '.join(icd)
# Write files
with open('icd_text_sql.txt', 'w+') as f:
f.write(icd_text_SQL)
icd.sort()
print(icd[:10])

Written by

Physician Data Scientist & Pythonista

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store