Building a Doctor Directory

Why it is so hard to get right?

Published in

Oscar Tech

9 min readSep 15, 2017

Have you ever found a doctor on your health insurer’s website, only to call their office and be told that they don’t actually take your insurance or they no longer practice at that office? You might think that posting accurate information should be as easy as uploading a phone book-esque spreadsheet, but even after decades of publishing their doctor directories, health insurers still struggle to do it well. Why is it still so hard?

Searching for a doctor is a universal pain point in interacting with our health care system. When you search for a pediatrician, an urgent care clinic, or an orthopedic surgeon near you, you’re searching the doctor directory. An insurance company’s doctor directory is the list of all the doctors (and places like urgent care clinics and pharmacies) in its network along with details like their office locations, phone numbers, specialties, and credentials.

The doctor directory is one of Oscar’s core data assets. In addition to the search tools our members use when trying to find a doctor on our website or mobile app, it gives our Concierge Teams the details they need to help members when they call with questions. It feeds our claims system so we know which primary care doctor to pay to cover a member’s annual physical. Finally, it informs how we construct our networks so that our members can see top-notch doctors across all the specialties they might need.

*Doctor directory results for a simple doctor search to treat asthma in New York City.*

Building and maintaining an accurate, current, and useful directory is a complex data and technology problem that Oscar has worked to solve. Below, I discuss some of the technical aspects of how we build our directory and what makes it challenging to do well.

The data feeding the directory comes from many sources: health systems and doctor groups; national groups of facilities and labs, and urgent care clinics; and third-party vendors who curate specific data like credentials and certifications. Every week, we receive 50–100 different data files across all of these partners. Currently, these feeds come mostly from our partners in our three 2017 network states (New York, California, and Texas). As we expand in 2018 into Tennessee, Ohio, New Jersey, and new areas of Texas and California, our data integrations will grow accordingly. These integrations, which we call “rosters” (yes, we think of our doctors as akin to star athletes!), contain basic information about each doctor like their name, ID number, specialties, and office locations as well as more detailed attributes like what languages they speak, their board certifications and licensing, and occasionally even their headshots. This data most often arrives to us as a CSV file, but it’s common for us to receive this data in Excel (*.xlsx) spreadsheets, over email, or one of a couple special provider data formats (e.g., the fixed-width PNDS format). Sometimes the data included in these files represents a simple change to the system (e.g, “add these providers to the network”), but often it represents a snapshot view of the world (e.g., “here are all the providers we have”). The CSVs are small, relatively speaking, usually in the hundreds to thousands of rows.

At this point, most would be thinking, “Integrating small CSVs of doctors? Seems pretty easy.” But of course, it’s not easy. Some high-level complexities are:

Data messiness (nonstandardized, dirty, incomplete, ambiguous, stale).
Determining network status (e.g., “is doctor X in network?”).
Reconciling doctors from multiple sources (or office, or facility).

Data messiness

There is no single, standard data format for provider data. All components, including delivery (file, REST), form (CSV, fixed width, Avro, etc), schema, data formats (MM/DD/YY or YYYY-MM-DD?), and even semantic codes (specialties, languages) come in different shapes and sizes. Sometimes important data like the ID number, last or first name, TIN (Taxpayer Identification Numbers), or specialties is missing for no clear reason. Sometimes it’s missing for a good reason; for example, it might be understood that all doctors in the roster have the same TIN or specialty. Other times, while all the necessary information is included,the data file is formatted incorrectly, causing fun hiccups like a column of numeric fields being interpreted as numbers, which results in a truncation of its leading zeros thus invalidating the data. We get thousands of 4-digit zip codes and 8-digit SSNs every week.

Other common issues include:

Random, unescaped newlines in a file.
Unescaped fields with commas, which throw off the column alignment of the remaining fields.
Stuffing more info in a field than we expected (e.g., the value “EN, SP, FR” going into the single “language_1” field).
Free-text specialty designations like “osteopathic manipulative medicine, special proficiency” (rather than codes from the standard taxonomy).
Doctors accidentally being omitted from rosters (and then added back on later).

If we built a hospital system’s data infrastructure from the ground up today, many of these issues would be easy to solve (at Oscar, we’re big fans of opinionated data formats like Thrift, Avro, and Protobufs). But many of the systems of record are decades old and thus difficult to change; they have many, many dependencies that span many groups throughout a hospital organization, so necessary changes require tinkering with the entire infrastructure that must remain stable. In an ideal world, there would be a single, standard format for provider rosters, but standards are often hard to wrangle. We are moving in that direction, but it’s a long way off, as many different parties have to coordinate to get there.

Determining network status

In-network doctors have contracts with health insurance companies to offer their services to at lower, pre-negotiated rates. Plans like Oscar’s will only cover care received with in-network doctors and facilities — -except in emergencies — -so it’s critical for Oscar members to have accurate and up-to-date information on whether a doctor is in-network.

Even if Oscar did receive squeaky-clean data in a beautiful, fully-specified form, we’d still have to determine network status for each doctor. But the status isn’t strictly defined on a doctor-by-doctor basis. Network status is also determined by a combination of many things, including each doctor’s group and practice affiliations, specialties, office, and TIN.

To further complicate this assessment, we also need to know a doctor’s past and future network status. When we get a claim today for an appointment from January 15, 2017, we need to know whether that doctor was in network on January 15, regardless of whether they’re in network now. This network status needs to be bitemporal, and it is determined by the contractual agreements we have with our providers. These contracts are different between every health system and insurance company, so there will never be a “single source of truth” for our systems to use to pull network status.

Let’s say we have a pediatrician who used to be in network in 2016 but is no longer in network in 2017. We obviously don’t want to show that doctor when members search for pediatricians near them. But if a member messages their Concierge Team, we need our team to clearly see that the doctor is explicitly out-of-network (rather than just missing) so they can better help the member find an in-network pediatrician.

Reconciling doctors from multiple sources

We may get data about a doctor from many different sources. Maybe they are employed by one hospital system and have admitting privileges at another (so we see them on both hospital systems’ rosters). We may get their board certifications from one vendor, and their licenses from a different vendor. We need to determine first whether these four data points represent the same doctor and then merge them together to present a single, consistent provider to the outside world. This is an entity resolution problem.

The NPI, a unique identification number for covered health care providers, gets us most of the way there, given that it has a Luhn checksum and thus makes transcription errors detectable. But different NPIs exist for individual and group insurance, which causes confusion. Sometimes, systems will accidentally give the group NPI in place of the individual. Other times, there’s a transcription error (e.g., a leading zero gets dropped somewhere along the way). It’s even more tricky for networks of facilities, like urgent care centers, which may share the same NPI because they belong to the same corporate entity.

But let’s say through the NPI we’re able to determine that provider A from one source is the same as provider B from another. Awesome! Now we get to resolve all of the child entities (billing group, offices, specialties, education, etc) of A with all of those with B. Is office 1 from A the same as office 1 from B? We use multiple address normalization and geocoding APIs to help us with this task, but ultimately “123 Main St., Suite 10A” is a different address from “123 Main St.”, and even geocoding algorithms can have some drift such that an address’s geocode from a year ago may be very slightly different than that of the same address today. “123 Main St, Suite 9A” may have the same geocode as “123 Main St. Suite 10A”, but they are clearly different offices. Or are they? Maybe it’s just a typo. Offices are the hardest of all the entities we reconcile for this reason.

*Oscar Center: Floor 4 or Unit 4? Luckily, we know, but it’s not as easy to determine for other providers.*

How is Oscar currently tackling these challenges in a smarter way?

We’re far from perfect, but over the past few years we’ve developed a set of internal standards to make our Doctor Directory as robust and accurate as possible. A few approaches we take are:

Internal roster standard: When doctor rosters come in, we immediately transform them into this standard, so all subsequent operations on them (e.g., provider reconciliation) work with a unified representation.
“Rebasing” fresh data into manual overrides: When fresh data comes in, we unwind the manual overrides we have, merge in the fresh data, and then re-apply the manual overrides so they don’t get trampled by the new (but often still incorrect) data.
Multiple, semi-redundant stages of validation: These are both manual and automated, and help detect errors (e.g., an entire group of cardiologists appearing out of network because of a bad TIN) before they become public.
A service layer between the data storage and its clients: This allows us to present a well-defined data interface for our many web and mobile applications while improving and changing the internal storage.
Constant cross-team communication: This is our most valuable method. Engineers (myself included) are in constant contact with account managers in Network Operations, data scientists in our Risk group, and our processors in Claims Operations. We eat lunch and grab coffee together. As Oscar grows, we constantly reflect on how to keep these communication lines (both personal and data) healthy and open across teams.

What are we building to better solve these problems?

We have a number of plans to improve our provider data management system over the next 12 months, including:

Tightening our relationships with our upstream provider partners to receive cleaner, more standardized data and to more quickly report issues back upstream.
Moving from a weekly cycle of batch processing to real-time event-driven flow so we can move many of our operations workflows from asynchronous to synchronous, allowing errors to be fixed and new issues to be flagged within seconds instead of days.
More internal standardization around errors and summary statistics from each stage in our processing pipeline means that we can build internal UIs and other tools to see these signals as data flows through our pipeline stages.
Moving network status determinations into a separate service to separate business logic from ETL logic, will allow us to more flexibly and clearly handle the ever-present network status challenge.
First-class bitemporality for all of our internal data storage means clearer, simpler, and easier to maintain code when we need to perform complex network status operations.
Better ETL tooling that allows us to move logic from code to configuration, means less code to write and test and thus more time to build new cool stuff.

We’ve learned the challenges of managing provider data through many, many hard knocks over the last four years. It’s difficult. But that’s what makes it fun. It’s also solvable. This needs to be a collective effort, meaning it won’t be any one format, library, team, or company that solves this problem, and it won’t happen in the next year — maybe not even the next five. But collectively, through a lot of hard work, we’ll get there together, and our dysfunctional health care system will run a little bit better.