Care and Feeding of Healthcare Datasets

Published in

The Trilliant Health Tech Blog

13 min readMay 16, 2023

Healthcare data sources are so varied and diffuse that the first question to ask has nothing to do with the data at all; instead, it is vital that we understand the problem we are trying to solve. Only then can we begin to assemble a dataset that will enable us to answer the questions we may have. At Trilliant Health, our questions are fundamental: who is doing what to whom, where did it happen and how much did it cost?

image generated by Midjourney — “a watercolor of a person looking at a field of databases”

That is a question that is simultaneously specific and broad: how can I track a patient’s total cost of care, analyze a specific physician’s referrals, or determine procedure volumes at a single facility? At the same time, how can I find the median cost of care for a cohort of patients, or analyze network patterns for a group of physicians, or determine procedure volumes for all facilities in a market?

There are fundamentally two different sets of data that can be used to answer these questions: clinical data and claims data. It is important to note that both of these datasets are not the activity itself; instead, they are evidence that the activity happened. They should be viewed as a sort of fossil record or exhaust of the healthcare engine.

Clinical data is generated by an EMR system, generally at the point of care. The clinical dataset might have physician and nurse notes, lab results, imaging interpretations and other information generated by the provider. It likely has procedure codes and diagnosis codes related to a patient as well, but not necessarily billing or remittance information. Most importantly, we should understand that the clinical dataset is likely limited to a single facility or at best a large hospital system. While you may have complete visibility of what happened at that particular location, there is still no visibility into that patient’s activity at another facility or a surgeon’s case volume at another hospital.

… claims data almost always leaves the four walls of the facility, because the provider wants to be reimbursed for the care they provided.

Claims data, instead of being a by-product of the patient/provider interaction, is a by-product of the billing process: claims are submitted from a provider to a payer and remitted from the payer back to the provider. Setting aside the content variation between claims and clinical data, the fundamental difference between the two is that claims data almost always leaves the four walls of the facility, because the provider wants to be reimbursed for the care they provided. Because claims data has to flow from a provider -> payer -> provider, the datasets are not segregated by facility but are instead available at a market level.

Comparison of data attributes in claims and clinical datasets

We’ll take a much deeper dive into claims data, where it can be sourced, and the nuances of it in a later article: “What, exactly, is Claims Data?”. For now, suffice it to say that claims data helps us answer a few of our fundamental questions: who is doing what to whom, and where did it happen (we’ll talk about the “how much did it cost?” part in a later article when we get to Health Plan Price Transparency data). Claims data can help us answer those questions because providers want to be paid for the services they render. To receive payment, a provider must submit a claim to a payer, and the payer will almost always require at least four key pieces of information:

who is the beneficiary (the patient on the claim),
who provided the service (a type 1 NPI or type 2, hopefully, both),
what service was provided (CPT, HCPCS, ICD, etc.) and
when the service was provided (date of service).

If we’re lucky, there is more information available as well, such as a billing address, facility address, diagnosis codes, place of service codes and much more. But we can almost always rely on those four pieces of information to be present, because the provider wants to be reimbursed.

At Trilliant Health, the vast majority of our products are built using claims datasets. We strive to understand market and population level analyses; we want to answer questions about healthcare economics and strategy – this requires datasets that cross the boundaries of facilities and health systems.

Now that we’ve established why we use claims data, and that claims data almost always has a core set of data, let’s talk about how we can use those pieces of information to answer our fundamental questions.

Who did it?

This question is answered by NPIs present on a claim. NPI stands for National Provider Identifier and is a unique identifier assigned to a provider by CMS. This became the standard in 2004, with the passage of HIPAA.

CMS provides an NPI FAQ that I would encourage you to read, but the important pieces of information are summarized below:

All healthcare providers who are HIPAA-covered entities, whether individuals or organizations, must get an NPI.
Covered healthcare providers, all health plans and healthcare clearinghouses must use NPIs in their administrative and financial transactions.

So – every provider must have an NPI, and those NPIs must be used in administrative and financial transactions (i.e., claims data). Great! So now that we have an NPI, how do we translate that into a person or an organization? CMS manages the issuance and dissemination of NPIs through NPPES, the “National Plan and Provider Enumeration System”. This system allows you to search individual NPIs (search here), or to download a dataset of all NPIs (here).

There are two categories of providers in the NPI enumeration system.
Type 1 (Individual) and Type 2 (Organization). Generally, you can think of a Type 1 NPI as a person: a physician, nurse, pharmacist, or physical therapist, for example. However, a Type 1 NPI could also represent an entity formed as a Sole Proprietor, which can cause confusion at times.

Similarly, you can think of a Type 2 NPI as an organization. An organization could have a single employee (an incorporated individual) or thousands. To make things a little more complex, an organization can have “subparts”, which aren’t legal entities themselves but may conduct transactions on their own. These subparts will be assigned their own NPI. Finally, we all know that corporate structure can be as simple or complex as the corporation desires. A large, national hospital chain may end up with many separate legal entities, with many subparts within each entity. This is how you end up with a single “entity” that the general public views as a monolith (HCA, UHS, etc.) having hundreds or thousands of Type 2 NPIs associated with it. We’ll dive into this topic much deeper when we talk about the need for a Provider Directory.

So, NPPES can help us translate an NPI into a person or organization; it can also provide other pieces of information:

a Taxonomy (in short, the provider’s specialty) for that Type 1 or Type 2 NPI.
a practicing status (active or inactive).
a primary and secondary practice address.
a last updated date.

Let’s take a look at that last item, no. 4. If you read the NPI FAQ (you did, didn’t you?), you might have noticed that it was never specified that a provider or entity must update their information on a specific frequency, or at all.

For example, let’s select a random provider: Dr. Nicole S Gunasekera, NPI: 1457889909. If you take a look at her NPPES record, it hasn’t been updated since 2017. Maybe that taxonomy is still correct, and maybe Dr. Gunasekera still practices at the listed primary or secondary practice addresses, but I wouldn’t put much faith in it. In fact, I would put very little faith in it — a quick Google search shows that Dr. Gunasekera is now a Dermatologist practicing at Brigham Dermatology Associates; she completed her internship at Beth Israel Deaconess (which happens to be the address listed in her NPPES record), and she certainly does not have a primary taxonomy of “Student in an Organized Health Care Education” anymore.

Again, we’ll talk about this more in our Provider Directory article in the future. So, now that we can gather some information about who provided care (both the individual and the organization), let’s move on to what did they do?

What did they do?

This question is answered by the procedure codes that are included on the claim submission. There are a few different datasets for “procedure” codes, and the likelihood of seeing a particular set depends on the type of service provided and the setting of care.

image created by Midjourney — “a naturalist illustration of a healthcare database”

CPT

Current Procedural Technology (CPT) codes describe outpatient procedures. CPT and its contents are trademarked by the American Medical Association (AMA). The AMA publishes data files on a quarterly basis. The data files include codes with their respective long, medium and short descriptions. If you are interested in a deeper dive, the AAPC provides a great overview of CPT at the link above; meanwhile, we’ll cover the highlights.

There are three categories of CPT codes, however, the vast majority of codes you encounter in the wild will be category 1.

Category I: commonly used by providers to report their services and procedures.
Category II: supplemental tracking codes used for performance management.
Category III: temporary codes used to report emerging and experimental services and procedures.

CPT codes are bucketed into ranges based on the type of service provided, as opposed to hierarchical levels like ICD-10 codes (which we’ll talk about in a minute).

Evaluation & Management (99202–99499)
Anesthesia (00100–01999)
Surgery (10021–69990)
Radiology Procedures (70010–79999)
Pathology and Laboratory Procedures (80047–89398)
Medicine Services and Procedures (90281–99607)

Temporary codes show up in category III and are depicted with four numbers and the letter “T” (e.g., 0001T). This is where you should be looking for new procedures that are being introduced into the market, and they can live as a temporary code for a number of years before the AMA moves them (or not) into a permanent category.

CPT codes are assigned (naturally) at the procedure level, not the claim level, so you will often see multiple CPT codes on a single claim.

DRGs

DRG stands for Diagnosis Related Groups. DRG is a system of classifying a patient’s hospital stay into various groups in order to facilitate payment of services. DRG codes come from two systems, Medicare Severity (MS-DRG, managed by CMS here), and All Patient Refined (APR-DRG, managed by 3M here). MS-DRG codes are 3-digit. APR-DRG codes are 4-digit, with the first 3 digits corresponding to diagnoses/procedures (like a normal DRG code) and the final digit signifying severity in the range of 1–4 (1 = least severe, 4 = most severe). DRGs from either system are likely to show up on institutional claims only, and are assigned at the claim level, not the procedure level.

ICD-10-PCS

ICD-10-PCS is a classification system that is used for coding procedures and services provided in the inpatient setting of hospitals. Not to be confused with ICD-10-CM diagnosis codes, these are procedure codes. The presence of an ICD-10-PCS code instead of, or in combination with, a DRG code is largely driven by the contracts between the payer and the provider, and what the payer wants to see on the claim. ICD-10-PCS codes are only used in inpatient and hospital settings.

Revenue Codes

Revenue codes (managed by the National Uniform Billing Committee, or NUBC, reference here), are also used almost exclusively in inpatient and hospital settings. These codes are used to indicate the department or place in which a procedure or treatment is performed — an emergency room, operating room, or some other department. This information can be useful when trying to determine if the claim is related to an inpatient stay, for example.

HCPCS

HCPCS codes classify outpatient medical supplies, equipment, medications and services not included in CPT. HCPCS data files come from the Centers for Medicare & Medicaid Services (CMS) on an annual basis. The data files include long/short descriptions along with internal (sub-)categories.

Who did they do it to?

The who of a claim is determined by a patient’s demographic information (name, date of birth, address, etc.). However, we deal exclusively with claims data that has been de-identified through the expert determination method defined here. If we redact that information, how can we link claims within a patient? We generate an irreversible hash (a token, or identifier) for each patient from their demographic information, and we use that. We’ll talk about that process in more detail in the “What is Claims Data?” article; for now, rest assured we have an identifier that is generally unique to a patient, so while we don’t know who a patient is, we can track their longitudinal care across providers and organizations.

Some other information that exists on the claim can give us clues about the “who”: the claim filing indicator code and the diagnosis codes.

Claim Filing Indicator Codes

These codes can give us an idea of the type of coverage a patient has, which can give some context surrounding the claim. There aren’t many of these codes, so we’ll just list them here:

+----+-----------------------------------------------------+
|            Claim Filing Indicator Codes                  |
+----+-----------------------------------------------------+
| 09 | Self-pay                                            |
| 11 | Other Non-Federal Programs                          |
| 12 | Preferred Provider Organization (PPO)               |
| 14 | Exclusive Provider Organization (EPO)               |
| 15 | Indemnity Insurance                                 |
| 16 | Health Maintenance Organization (HMO) Medicare Risk |
| BL | Blue Cross/Blue Shield                              |
| CH | CHAMPUS                                             |
| CI | Commercial Insurance Co.                            |
| HM | Health Maintenance Organization                     |
| MA | Medicare Part A                                     |
| MB | Medicare Part B                                     |
| MC | Medicaid                                            |
| OF | Other Federal Program                               |
| VA | Veterans Affairs Plan                               |
| WC | Workers Compensation Health Claim                   |
+----+-----------------------------------------------------+

ICD-10-CM codes

ICD-10-CM is the 10th version of the International Classification of Diseases, which is a medical classification list developed by the CDC under the authorization of the World Health Organization (WHO). Here is a great FAQ on ICD-10-CM from the WHO here. What is important from a dataset perspective is that ICD-10-CM is a classification of a patient’s condition or diagnosis, and not of a procedure performed on them. ICD-10-CM codes are assigned at the claim level, not the procedure level and generally are listed in order of importance, with the first code being the primary diagnosis, the second code being the secondary diagnosis and so on.

Where did they do it?

Finally, we have the “where” of a claim, which can be determined in a number of ways. If we are lucky, the claim will have both a site of service address (also called a “facility” address) and a billing address.

created by Midjourney — “an ambulance at a hospital with a doctor in the style of monet” — image created by Midjourney — “an ambulance at a hospital with a doctor in the style of Monet”

However, even if we have both of those, we cannot always trust them. Depending on the type of claim, the procedure and administrative practices, both of those addresses may be “correct”, but misleading. For example, a professional bill from a surgeon may simply list an administrative office in both address fields, but not the hospital address where the procedure was performed (you could expect that on the institutional bill). Another example is a radiologist performing a remote interpretation of a CT scan; the “site of service” might be an office location a thousand miles away from where the scan took place, but the patient was never physically there! If you ignore these complications completely, you still have issues where the address provided may be a hospital campus or a multi-office building, which can be hard to know specifically which entity to attribute volume for that claim to.

There are, however, a number of fields on a claim that can give you some clues:

NPIs, NPPES and Addresses

Didn’t we already talk about NPIs and an upcoming Provider Directory article? Yes, we did, but we’re going to talk about them again here because they are so important. However, what we want to point out is that in this case, using the NPI to connect a claim to an address from NPPES will guarantee you a bad time. There are two primary reasons why:

NPPES records can be incredibly out of date.
Providers can work at multiple locations.

Even if NPPES records were 100% up to date, you still cannot map from a claim to an address using an NPI, because providers often work at multiple locations. Let me state it a third time: providers change jobs as often as any other profession, and a single entity often has many locations at which a provider will see patients. Are there ways to solve this? Yes -– see the (upcoming) Provider Directory article.

Place of Service Codes

Place of service codes are a great way to get a sense of where a service was rendered. These are managed by CMS, and there is a reference here. POS codes can be very useful in determining the setting of care (inpatient, telehealth, etc.) when other methods are inconclusive. Fair warning though: POS codes are not always present, and when they are, they are not always accurate. You will often see POS codes that are inconsistent with other information on the claim.

Census data

While not technically a healthcare dataset, we would be remiss if we didn’t mention the use of census data to give geographic-based insights on both the patient location and the setting of care. The 2020 Census has a wealth of information on the demographics of the US population and can be linked to healthcare datasets through the geographic hierarchy. One thing to note: Postal codes do not equal Zip Code Tabulation Areas! There are lots of commercial datasets that do a good job of mapping one to the other, but a postal code is a route that a mail carrier will take. The census geographies are MECE (mutually exclusive, collectively exhaustive); for example, a single postal code can cross county or state boundaries, but a ZCTA will not.

Summary

Whew! That is a lot of different datasets and acronyms to wrap our heads around. Don’t worry — we’ll frequently dive into more details over this entire series of articles, so feel free to come back to this article to remind yourself of a definition or to find a link. Below is a summary of which datasets tend to lend themselves to what uses:

Categories of information for different data sources

Thank you for reading! Our next article is a deeper dive into a healthcare claim, so stick around if you’re interested in learning more.