An Introduction to Claims Data

Published in

The Trilliant Health Tech Blog

11 min readJun 8, 2023

Welcome to a foundational, if lengthy, discussion of the essentials of healthcare claims data. The exchange of goods and services for payment is at the core of any economy, and claims are the glue that holds our healthcare system together. The entire health economy, from healthcare services, software, business models, insurance and so much more are all shaped and influenced by the concept of claims and the infrastructure surrounding them. Much like automobiles are intrinsically reflective of the infrastructure that supports them (roads, gas stations, etc), the healthcare economy is intrinsically reflective of the infrastructure that supports it.

image generated by Midjourney — “watercolor of an onslaught of paper documents overwhelming doctors”

This may seem like an unglamorous and somewhat low-level topic of discussion, but I promise to keep things interesting. I will start with a brief history, then review some of the infrastructure that enables claims exchange, followed by a review of the data that is actually available. In our next article I will cover some difficulties that are common in datasets this large and then share a few use cases.

Let’s dive in…

What, exactly, is a claim in the first place?

A claim is a request for payment for services rendered. This definition applies broadly to the insurance industry, whether automotive insurance, life insurance, travel insurance, pet insurance… The list goes on. In health insurance, you, the insured party (also referred to as “beneficiary”, “enrollee”, “member”, “patient”, etc), receive a service from a provider (also referred to as “doctor”, “physician”, “hospital”, “clinic”, etc), and then either you or the provider submit a claim to the insurance company (also referred to as “payer”, “health plan”, “insurance carrier”, etc) for payment. These days, it is almost always the provider that submits the claim. The insurance company then adjudicates the submitted claim, and either pays the provider (the requested amount, or some portion of it), or denies the claim.

When was this practice started?

Insurance, abstractly, is just another form of gambling, which has been around as long as humans have. One party is taking a bet that something bad will happen, and the other party is taking a bet that it won’t.

There are records of insurance-like practices dating back to the 3rd and 2nd millennia BC, mostly around maritime shipping; the first insurer in America was Benjamin Franklin, who formed a fire insurance company in Philadelphia in 1752. Here is a link to the first chapter of “Health Insurance, 2nd Edition” by Michael A. Morrisey, which provides a nice summary of the history of health insurance in America. If you are interested in the background, I highly recommend reading the book (or at least the first chapter).

For our purposes, the current landscape of claims data really “started” on August 21, 1996, when the Health Insurance Portability and Accountability Act (HIPAA) was signed into law. While HIPAA is most commonly known for its privacy and security rules, it also established a set of standards for electronic healthcare transactions, including the electronic submission of claims. Before 1996, almost all claims submissions were performed via paper on either the CMS form 1500 for professional claims, or the CMS form UB-04 for institutional claims (we’ll talk about institutional vs professional in a moment).

The deadline for providers to transition from submitting paper to electronic claims was October 16, 2003, but most providers had already adopted electronic claims submission by 2000.

Notably, the deadline applied to CMS claim submissions, but not commercial claims submissions. The transition to electronic claims submission for commercial payers was much slower and is still ongoing today. Pragmatically, that means that most claims datasets have fairly poor coverage of commercial data prior to 2017. When examining claims data pre-2017, changes in dataset volumes tend to have more to do with the shift to electronic submission than changes in utilization.

What does “claims submission” look like? How do you know if you’re capturing all of the data?

You don’t know! We get that question all the time. There is no known denominator to calculate the percentage of claims that are in a dataset versus all the claims that have been submitted. Like cloud computing, healthcare delivery is massively distributed, with thousands of payers, millions of providers, and hundreds of millions of patients. Additionally, there are hundreds of software systems built to manage the exchange of claims, many dozens of clearinghouses to move them along their electronic journey, and all sorts of “custom” integrations that route data one way or another.

Wait, what’s a clearinghouse? How do you actually get this data?

When CMS mandated the switch to electronic claim submission in 2003, clearinghouses stepped in to fill the void of “delivery” of those electronic records (as opposed to the USPS). The issue for providers is that there are thousands of payers, and providers do not reasonably have the capability to adhere to each payer’s specific requirements around the submission. Clearinghouses solve that for the providers by acting as a “hub” for claim submissions; providers send all their claims to one place (the clearinghouse), and the clearinghouse will route the claim to the appropriate payer according to that payer’s specification. I’ll recommend reading this excellent summary written by JM Sculley from clearinghouses.org.

image generated by Midjourney — “a doctor trying to row a boat through an endless sea of paper in the style of Georgia O’Keeffe”

Clearinghouses provide additional value beyond claims routing: eligibility verification, electronic remittance advice, and rejection analysis are just a few. These are all compelling incentives to use a clearinghouse, despite the service fees that clearinghouses charge. However, not all providers use clearinghouses, and even within a single hospital system or a provider group, they may bifurcate their handling of claims based on the destination. For example, many large providers submit claims to CMS directly, while smaller ones might leverage a clearinghouse to do so. Similarly, if a hospital system has a large enough volume with a particular payer, they may arrange to submit to that payer directly, while using a clearinghouse for everything else.

Having aggregated billions of claims transactions, clearinghouses can generate additional revenue by licensing the aggregated datasets they have assembled, as can health plans. Additionally, several companies aggregate claims data from multiple clearinghouses, payers, and provider groups, normalize the data as best they can, and license it to other health economy stakeholders.

This is how big, national, many-billion-row claims datasets come into existence. They are not exhaustive, they are not perfect, and their “completeness” depends largely on clearinghouses’ and payers’ willingness to license access to the web of claims that flow between the providers, clearinghouses, and payers. Pragmatically, what this means is that a dataset might have an excellent representation of a specific hospital but a provider across the street might be completely absent. Or, even within a single facility, a particular service line or payer might be missing while the rest of the claims are present in totality.

Hopefully, this has cleared up some of the mystery surrounding where these massive datasets come from. Now that we’ve established where these massive, de-identified datasets come from, let’s dive into the data itself.

Wait — what do you mean, de-identified?

The datasets that we work with, and large claims datasets generally, are de-identified. This means that the datasets have been stripped of any information that could be used to identify a patient. We touched on this in the Care and Feeding of Healthcare Datasets article, but let’s explore it in a little more depth.

First — there are two methods of de-identifying health information that comply with HIPAA: “Safe Harbor” and “Expert Determination”.

`Safe Harbor`

This method is quite comprehensive, guaranteed and deterministic. Detailed here, it is essentially a list of redactions that must be made to the data that will result in no (or extremely low) risk of identification. One requirement of this method is especially burdensome for those of us who are leveraging these datasets for analytics.

“All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.”

Obviously, the date of service is a critical piece of information for any analysis. The age requirements are less onerous but do manifest as “weird” age distributions in the data.

`Expert Determination`

This method is more complex, less clear, and based on statistical assessment rather than enumerated rules. The guidance is detailed here. Essentially, a statistician will examine a dataset and redact data fields until there is a “very small risk” that any individual could be identified. While there is certainly more effort in de-identifying a dataset through expert determination, you will generally end up with a much richer dataset (including dates!) than if you follow the safe harbor method.

Patient Tokenization

This brings us to patient tokenization. Even if you’ve chosen to de-identify a dataset through expert determination, you will still have to redact certain fields to achieve a level of risk of re-identification that qualifies as “very small”. Name, date of birth, gender and street address are among those that almost certainly must be masked. However, maybe there is a way we can have our cake and still eat it? Yes! We can concatenate a patient’s information into a long sentence, run that through an irreversible hashing mechanism, and now we have an identifier that is both statistically unique and deterministically generated.

What that means in plain English is that if both Big Clearing House A, and Bigger Clearing House B have data for Matthew ONeill, born 1/1/1896, they can both generate the same patient identifier, so that the datasets can be both de-identified and linked. This is a critical step in the process of aggregating claims data from multiple sources.

var token1 = "Matthew" + "ONeill" + "18960101" + "M";
// output: "MatthewONeill18960101M"

HASH(token1, key)
// output: 5f4dcc3b5aa765d61d8327deb882cf99

What must be coordinated between the two parties hashing the token, however, is the “key” that is passed to the hashing mechanism. The current dominant player in this space is Datavant, whose business is organized around this concept. They have a good white paper that goes into the detail of tokenization a bit more.

One thing their paper doesn’t address is the problem of “splits” in the patient space. To illustrate this problem, imagine how many variations my last name can be misspelled (ONeil, O”Neill, Neill, Neil, ONeal). I have experienced all these variations, and quite a few more. Because the patient’s last name is a component of the token, if two hospitals spell my name differently (and they will) then two different tokens will be generated! This is a problem that is not easily solved and is often a source of frustration for those of us who work with these datasets extensively. The reality, however, is that the claims volume associated with these “splits” is comparatively small, making it a manageable problem once you are aware of it.

Now that we know where claims come from, what is on them?

Perhaps counterintuitively, the easiest way to walk through the available data on a claim is through the lens of the paper form. The X12 837 specification is complex, and it is easy to get lost in the detail of loops, segments, and elements. The paper form, pictured below, is far more intuitive.

If want to know more, I have included a field-by-field overview of the CMS 1500 form here. If you are new to claims data, I suggest skimming through it for a sense of what is available.

At a high level, a claim contains the following:

Patient demographics
Patient diagnoses
Rendering, referring, and billing provider information
Procedures that were performed
A site of service address
A billing address

OK, but the paper form has largely been deprecated; what is the electronic format?

Claims are submitted in ANSI ASC X12N 837 format. What does that mean? Let’s break it down:

ANSI: American National Standards Institute
ASC: Accredited Standards Committee
X12N: Insurance section of ASC X12 for the health insurance industry’s administrative transactions
837: Standard format for transmitting health care claims electronically

There are oceans of ink spilled on the electronic specification; here are a few useful links:

Stedi provides a very useful reference and EDI inspector (registration required).
CMS overview of the relationship between the 837 and the CMS 1500
NUCC mapping from the 1500 form to 837 format

There are three types of 837 transactions: 837P, 837I, and 837D. The “P” stands for “professional”, the “I” stands for “institutional”, and the “D” stands for “dental”. We’ll focus on the “P” and “I” transactions, as they are the most common and most relevant to our work.

Whether an 837P or 837I is submitted is determined by the organization type, not by the service provided. If the organization or individual is a physician, practitioner, or supplier, they should submit an 837P. If the organization is an institutional provider, it should submit an 837I.

On page three of CMS’s 837I fact sheet is a list of institutional provider types that should be submitting an 837I.

That list is replicated here for convenience:

Hospitals
Skilled Nursing Facilities (SNFs)
ESRD providers
Home Health Agencies (HHAs)
Hospice Organizations
Outpatient Physical Therapy/Occupational Therapy/Speech Pathology Services
Comprehensive Outpatient Rehabilitation Facilities (CORFs)
Community Mental Health Centers (CMHCs)
Critical Access Hospitals (CAHs)
Federally Qualified Health Centers (FQHCs)
Histocompatibility Laboratories
Indian Health Service (IHS) Facilities
Organ Procurement Organizations
Religious Non-Medical Health Care Institutions (RNHCIs)
Rural Health Clinics (RHCs)

Now, you may be wondering about ASCs (ambulatory surgery centers), FSEDs (free-standing emergency departments), and UCCs (urgent care centers). They are all somewhat of a special case, from which claims are often submitted as an 837P, even though they are technically institutional providers. The reason is that these facilities are often owned by physicians and frequently billed as if they were physician offices. This is a bit of a grey area, and you’ll see different payers handle this differently.

image generated by Midjourney — “a surrealist paper claims form”

OK — we know what a claim looks like and where they come from. What can we do with them?

A lot. There are an almost endless number of different analyses that are enabled through claims data, especially if you are viewing this data through the lens of a healthcare provider, payer, or working in life sciences. Having an understanding of how and why a dataset comes into existence is important when thinking through what insights you can glean from it.

Your head might be swimming with all the possibilities, and that’s OK! This is a lot to digest, so in the next article we’ll cover data caveats and some introductory use cases.

Happy analyzing!