Working with Healthcare Claims Data in Python

7 min readFeb 18, 2020

The key to successful analysis of healthcare claims is having a detailed understanding of the data structure and meaning. Yet claims data is often difficult to understand, poorly documented, and even treated as an industry secret. I’ve been working with healthcare claims data for eight years, and this post will provide an introduction to data analysis with healthcare claims. I will illustrate the examples with code in Python, but these principles apply across programming languages. I’ll cover some basic information about healthcare claims data formatting and structure as well as common pitfalls in reading in claims data and data cleaning.

Introduction to Healthcare Claims Data

You’ll quickly notice that most of this article is text, and there are only a few coding examples.

Industry knowledge is what separates a simple data summary from insightful and actionable analytics.

I can’t stress enough the importance of content knowledge to healthcare analysis in particular. So, I’m going to focus in this article on some key and unusual aspects of the data that I’ve noticed from training many programmers new to healthcare and will provide some examples.

If you are brand-new to healthcare, it’s important to do some research on how healthcare insurance works before looking into the data itself. The healthcare system in the US is complicated, as anyone who has been a patient and dealt with their insurance company knows.

To understand healthcare claims, a great place to begin is with Medicare claims data. The US government makes many resources available to learn about the data, and healthcare claims data is generally similarly structured across payers. ResDAC is a central repository of information on Medicare claims data and contains everything from introductory presentations to data dictionaries to guides on requesting data.

Claims Data Structure

The resources on ResDAC can also help you understand how claims data is structured. For many payers, different types of services are billed in different formats and stored in different files. If a patient is treated in the hospital, the bills the hospital sends to the insurer for that hospital stay may be in a separate file then the bills the physician sends to the insurer for the patient while they saw them in the same hospital. Even though the patient experiences the hospitalization as one “event”, each entity involved separately submits its own bill to the insurance company. This is a confusing point for most people at the start, but it makes sense once you understand health insurance. If different services for the same patient are split across files, you of course need to find the unique patient identifier that will allow you to associate them together.

Reading in Claims Data

To read in healthcare claims data to your system of choice, you may be looking at an unknown number of files and/or an unknown number of columns. Data is often broken into separate files due to size constraints on the storage server, but you may want to combine all the data to work with in your own systems. To address this issue, you would check the files in the directory and stack them together if needed.

In some cases, the data format can change over time. Some payers will send only the fields necessary in the data — for example, if the claim has 25 possible diagnosis codes but only 15 are used in the data cut, the data you receive will only have those 15 columns. Setting up your pipeline to dynamically handle these changes will make it much more efficient and save you debugging time later on. Here’s a sample of how to check for column lengths and format the file on read-in.

Cleaning Data

As with any data analysis, cleaning the data is crucial to getting good results and takes up about 80% of the work. This post assumes you already know the basics of cleaning data in Python.

A simple first problem to solve is removing any unusual characters that are affecting data types. This can sometimes be addressed by setting your data types upon file read-in, but this may not always be sufficient. For example, some claims may come formatted with special characters like dollar signs and commas and can crash your code. Regular expressions are your best friend in this scenario, though they can be difficult to work with. Here’s a simple example.

In the US, several different coding systems are used to document the procedures and diagnoses that are associated with healthcare services. An introduction to the different code sets can be found on this governmental website. When looking at a raw (unprocessed) healthcare claim, the procedures and diagnoses fields will be alphanumeric codes — for example, you may see a diagnosis like J03.01. To understand the diagnosis, it can be helpful to add on reference files that map the plain codes to descriptions of what the codes mean — either by merging descriptions on or creating a reference table for a relational database. Some resources have APIs, as code definitions can and do change over time. As a word of warning, always take the time to format your data types and check for any leading zero problems. For example, one healthcare procedure coding system is always 5 alphanumeric digits but can start with a leading 0 — this can easily cause issues in your analysis.

Because of the large number of codes in a given coding system, reference files are not always a simple set of every code and its description. It can be very helpful to group codes together to get a summary description. For example, you may not need to know for the purposes of your analysis that J03.01 is acute recurrent streptococcal tonsillitis and J05.10 is acute epiglottitis without obstruction — it may be sufficient to know that both are types of acute upper respiratory infections. Some code sets are conveniently created so that each digit and its order has a clear meaning, so aggregating them up is very simple. However, some code sets do not have this handy feature, and such reference files can be added on using a range merge. This is easily done in SQL, so this example uses the sqlite library.

For more info, see https://docs.python.org/3/library/sqlite3.html

There are many types of services/organizations in a given type of claims data file. In a file containing inpatient hospital billing, there may be bills from acute hospitals (for treatment of acute events like a major surgery or injury), long-term care hospitals (for treatment of acute events that require a much longer time for treatment/recovery), psychiatric hospitals, cancer hospitals, critical access hospitals…. Understanding how to identify these different types of facilities or services will greatly improve any analysis. For example, if your goal is to calculate readmission rates, that typically only means acute care hospitals — using the entire inpatient hospital billing file would give you an inflated estimate.

This example only shows a few hospital types. For more info, see https://www.resdac.org/sites/resdac.umn.edu/files/Provider%20Number%20Table.txt

Summarizing Data

If the claims data is spread across multiple files, it can be helpful to combine them chronologically to understand each patient’s whole care trajectory. This type of format is commonly called a “vignette”, and it can be invaluable for clinical review of data. Building a vignette requires a nuanced understanding of the data in each file. Typically, I first process the files into a similar structure, then stack and order them.

Wrap-Up

This article has highlighted some key aspects of working with healthcare claims data in Python. I hope you have found it useful! In the future, I will expand on this with more coding examples and content focused on Medicare data.

If you’re a seasoned data scientist and looking for a new position, we are hiring! Check out our data science posting here and other developer postings here.