Harnessing Entity Resolution to Unveil Doctor’s Funding Secrets

Jacob Mazurkiewicz
14 min readApr 16, 2024

--

Photo by Marek Studzinski on Unsplash

In the labyrinthine of U.S. healthcare, transparency remains an elusive ideal. With millions of dollars flowing between doctors, government entities, and major medical corporations each week, untangling this web of financial transactions seems an insurmountable task. Yet, there’s a method to solve this madness: using the power of machine learning to achieve healthcare clarity through Entity Resolution.

Entity resolution is a field of machine learning that seeks to map records in different datasets to each other, identifying them as the same object or person regardless of possible variations in data [1]. Mapping databases of U.S. doctors to datasets of distributed grants can reveal which doctors received which grants, revealing how much money doctors are receiving and from which institutions. These results can explain how certain doctors may be incentivized in unethical ways (if they are receiving grants from a pharmaceutical company, for instance).

While this problem seems straightforward, it’s quite complex for computers to solve. For example, what if names of the same person in the complimentary datasets are misspelled or abbreviated?

Is “Dr. James Adams” in one dataset the same person as “Dr. Jim Adams Jr.” in another?

How about “Dr. Jacob Brown” and “Dr. Jake Brown”?

Issues in Entity Resolution arise when the same entity presents slight variations in their data within different databases. Entity resolution is, at heart, a classification problem seeking to solve the question are these two entities, or people, from different datasets the same?

Basic Entity Resolution Workflow

Scale also presents complexities. It’s computationally expensive to compare a record from one dataset to every record in the other dataset when looking for matches. If I have a dataset of grants that is 100,000 rows and a dataset of doctors that is 10,000 rows, it will take 1 billion comparisons to check every combination!

Checking every provider to grant combination is ineffective at scale.

To resolve entities in grants and doctor databases, I combined concepts across string distances, machine learning classifiers, graph theory, and vector embeddings to create a robust model that maps doctors to grants. In this post I’ll breakdown my methods and philosophies with tackling entity resolution problems. Let’s dive into it!

Reading the Data

Raw data for entity resolution projects usually requires formatting and cleaning. When scraping together online datasets and combining them, it's important to retain a standardized format for comparisons.

Data for this project came from two main sources:

  1. A record of grants given to doctors RePORT ⟩ RePORTER (nih.gov)(Grants/grantees dataset)
  2. A record of doctors obtained from NPI Files (cms.gov) (Providers/doctors dataset)

Creating datareader classes for each dataset that clean and format the raw files is a good method to create usable data frames. This allows you to reliably read in new data through a standardized pipeline that automatically handles cleaning fresh data. The datareader classes I created could:

  1. Read in the data to a Pandas data frame.
  2. Select the columns of interest (as there were dozens of columns in both datasets).
  3. Clean the data.

Cleaning

The raw data did need to be cleaned. For example, there was no “first name” and “last name” column in the grantees dataset, only a “Pi_Names” column with names in this format: [‘COOKE, BRAD’;‘WILLIAMS, CAROL’]. Names in this format cannot be matched to names in the doctors dataset which has alternative formatting.

The “Pi_Names” column also had multiple entries for different doctors under the same grant across various rows. This creates matching challenges since it’s ideal to have a single, distinct value per row.

To resolve these challenges, I used the Pandas explode() and String split() functions to create two new columns for the grantee’s first and last names. The split() function separates values into in array based on a defined delimiter (; in this case) while the explode() function breaks down a column containing array values into multiple rows, each with one value per row.

def _clean(self, df: pd.DataFrame):
"""Clean the names columns and returning new names"""
df['pi_names'] = df['pi_names'].str.split(';')
df = df.explode('pi_names')

df['is_contact'] = df['pi_names'].str.lower().str.contains('(contact)', regex=False)
df['pi_names'] = df['pi_names'].str.replace('(contact)', '')

df['both_names'] = df['pi_names'].apply(lambda x: x.split(',')[:2])
df['forename'] = df['both_names'].apply(lambda x: x[1])
df['last_name'] = df['both_names'].apply(lambda x: x[0])

return df

Ensuring I have reliable and formatted data frames is always my first step when implementing entity resolution projects. I find it useful to look through all the columns of my data prior to coding to get a feel for what cleaning techniques may be necessary. Here’s a good article to learn the fundamentals understanding your data to properly prepare it.

Setting up Databases

While a first pass for entity resolution projects can be comparing just two datasets, data for these projects often arrive in updated batches. I quickly realized that every year new rounds of grants are awarded to doctors. Over time, this quickly becomes a big data problem as millions of new doctors enter the industry and new grants are awarded.

In response, I set up a SQLite relational database to store my providers and grantees, allowing for effective data management at scale. Relational databases offer superior data management capabilities compared to CSV files. They provide advanced SQL data retrieval features, optimized performance, scalability, and enhanced security [2]. Their key-indexing also allows for efficient matching with other datasets.

Files may be lightweight and user-friendly but lack the complexity needed for managing large scale data.

SQLite is a lightweight Python library for managing relational databases. I used it to create three SQL tables from within my Python environment:

  1. A Grantee table.
  2. A Provider/Doctors table.
  3. A bridge table to match providers and grantees on a common value (like last name).
-- This query creates the provider/doctors table 
CREATE TABLE IF NOT EXISTS provider (
id INTEGER PRIMARY KEY NOT NULL,
npi INT NOT NULL,
taxonomy_code VARCHAR(25),
last_name VARCHAR(100) NOT NULL,
forename VARCHAR(100),
address VARCHAR(250),
cert_date DATETIME,
city VARCHAR(100),
state VARCHAR(100),
country VARCHAR(100)
);

To connect to a database file in python, a great library to use is SQLAlchemy. This library allows you to execute SQL commands in Python that directly alter your database.

Using SQLAlchemy to connect to my database file, I created my tables. I then used the Pandas to_sql() function to insert my data frames into the database tables, setting the parameter if_exists=’append’ to allow for new data to be added to the tables over time.

#Translating pandas dataframe to database
df.to_sql('grantee',
db.sql(),
if_exists='append',
index=False
)

To learn more about databases in python read more here. Now, I had a database with corresponding tables which could handle large quantities of data, and from which I could query data to train and test my classifier on!

Creating Distance Features

Now that I had cleaned and organized my data, I needed to engineer features to train my model on. Entity resolution projects often lack clearly defined features, unlike projects in fields like financial analysis, where features are more readily identifiable and structured. To resolve entities, there are a few options to create features that indicate if two names or entities are the same person.

  1. Leveraging Word Embeddings: By utilizing word embeddings, we can quantify similarity in a vector space. In simple terms, embeddings of names or locations that are closer together in an n-dimensional space are more likely to represent matches.
  2. Creating Binary Features: Another approach is to employ binary features based on logical rules. For instance, you can check if names in different datasets have exactly identical characters. This method simplifies comparisons, especially in structured data where exact matches are crucial.
  3. Exploring String Distance Measurements: Beyond binary features, a third approach uses string distance measurements like the Jarowinkler distance, Levenshtein Distance, or Hamming Distance. These measurements account for variations such as character substitutions, deletions, and insertions, providing a nuanced view of similarity between two strings.

For my model, I used the Jarowinkler distance and the Set distance between the forenames, cities, and states of rows (shared column types) in the two datasets to create my features. Jarowinkler distances are commonly used as features in entity resolution projects due to their simplicity and effectiveness [3].

The Jarowinkler distance is a string similarity metric that quantifies the dissimilarity between two strings based on the number of matching characters and their positional proximity [3]. The set distance measures how much sets of words overlap.

For example, the Jarowinkler distance between “Andrew Brown Jr.” and “Andrew Brown” is 0.95. The Jarowinkler distance between “Cynthia Williams” and “John Williams” is 0.71. Therefore, according to Jarowinkler features, the first comparison is more likely to be the same entity.

Choosing which features to create is a primary decision in resolving entities. When working with high dimensional data, it’s useful to consider more complex approaches like embeddings rather than simply comparing whether two values match exactly.

Sometimes, the simplest approach can perform just as well with less compute resources however. The key lies in understanding the problem context. Through experience, I’ve learned that no single feature method is universally superior; instead, the style of the data itself dictates the most effective features and techniques.

By creating a data frame of these distance features between the two datasets, I could begin work on implementing a classifier.

def jw_dist(v1: str, v2: str) -> float:
""" This function creates Jarowinkler distances """

if isinstance(v1, str) and isinstance(v2, str):

return jarowinkler.jarowinkler_similarity(v1, v2)
else:
return np.nan

#Combining data frames
comb_df = pd.concat([grantees.add_suffix('_g'), providers.add_suffix('_p')], axis=1)

#Calculating distance features between forenames
comb_df['jw_dist_forename'] = comb_df.apply(lambda row: jw_dist(row['forename_g'],
row['forename_p']),
axis=1)

Implementing an XGBoost Classifier

Selecting the appropriate classifier is crucial in Entity Resolution, as it’s a classification machine learning problem. For structured entity resolution, simple classifiers like logistic regression, decision trees, random forests, Bayesian additive regression trees, and others are commonly favored [1].

Despite the ongoing research in using deep learning for complex entity resolution tasks, I chose a reliable machine learning algorithm due to the data’s nature. I ultimately implemented XGBoost, a scalable implementation of gradient-boosted decision trees.

XGBoost performs very well when training on small sample sizes with limited parameters and has shown remarkable performance across a wide variety of machine learning tasks [4]. Due to the small parameter space and inherent class imbalance of entity resolution problems (many more non-matches than matches), I chose this model for my classifier.

Classifier Class

To do this, I created an EntityResolver classifier class that had the following functionalities:

  1. Initialize a model with an XGBoost classifier.
  2. Train the model with a train-test split on features and labels.
  3. Predict the class (match or no-match) of a grantee and doctor.
  4. Save the model artifacts.
  5. Load the model artifacts.
  6. Record relevant metadata like training time, training size, and model accuracy.
A small example of a grants-to-doctor graph. Entity resolution seeks to draw these node edges by determining which entities in one dataset match with an entity in a separate dataset.

After having created my class template, it was now time for the fun part, training!

Finding and Preparing Training Data

To train my newly created XGBoost classifier, I needed to find real matches and non-matches in the data to create labeled training data. While tedious, manually identifying matches in the data is often a required step to train an initial Entity Resolution model.

There are better methods than just scanning datasets without a plan for tackling this task, however. To find likely matches I could hand label, I used the SQL bridge table created previously.

By matching grantees and doctors on a common value like the last name, rows that are more likely to match can be extracted because grantees and doctors that share a last name are much more likely to be the same person.

-- This query pulls grantees and doctors with the same last name
SELECT DISTINCT gr.last_name
FROM grantee gr
INNER JOIN provider pr
ON gr.last_name = pr.last_name
LIMIT 100;

I exported this newly queried data to Microsoft Excel where I checked by hand for matches between entities.

Example of manually labeled matches and non-matches between the two concatenated datasets.

While obtaining ground truth in entity resolution labeling is nearly impossible, we can still draw robust conclusions. For instance, if two entities share the same first name, last name, hail from identical cities, states, organizations, and possess matching addresses, we can confidently classify them as a match within our model.

An example of this labeling process can be seen above where “Alice Summer Tang” from San Fransico in the grantee dataset was determined to be the same person as “Alice Tang” from the providers dataset, receiving a 1 label (1 for match, 0 for no match) in the Is_Match column.

If the two entities contained conflicting variables such as being located in different cities, even with the same last name, they were determined to be a non-match.

After repeating this process enough times to obtain a small sample of matches, I calculated my distance features on this new dataset. The number of matches initially needed will likely vary based on the data’s complexity and the classifier you choose.

Deploying Model using HNSW Vector Embeddings

My model achieved 100% accuracy during training on my features and the Is_Match label column! It’s important to understand the evaluation metric you’re optimizing for in the given project.

Is it more advantageous to return false positives but not miss any real matches? Then optimizing for recall is best. What if you want to correctly classify all of your predicted matches? Then focusing on precision is best. Evaluation is a major component of data projects; you can explore more here.

However, since this training was based on a small sample of matches and non-matches, the accomplishment might not be as impressive as it initially appears.

The class imbalance that frequently appears in entity resolution cannot be ignored, as the model fundamentally receives many more non-matches than matches during the training process.

To build a truly generalizable model, I needed to predict matches on unseen test data and identify new matches to iteratively train my model on. By training on additional matches, my model will enhance its performance and ability to manage edge cases.

The Blocking Problem

This brings us to a primary challenge of entity resolution, however, commonly called the “blocking problem”. As mentioned earlier, the challenge lies in comparing a row from one dataset to every other row in a different dataset, which can be incredibly computationally demanding.

The blocking problem has been around for decades, with early research from Gerald Salton and Michael J. McGill in 1987 into handling it paving the way to more sophisticated methods commonly used today [5]. While Salton and McGill focused on retrieving texts limited to defined criteria, the fundamental principle of choosing to return the most likely matches extends to entity resolution.

One simple approach to the blocking problem is to make comparisons based on a common key, like the bridge table. If we only make comparisons between doctors and grants with the same last name, we are more likely to find matches and minimize comparisons that are unlikely to be matches.

This simplistic approach struggles to perform well in high dimensional data where multiple values may match, however. In these cases, matching on one or even multiple variables may not be informative enough of a potential match.

An alternative approach is to use maximal frequent itemsets (MFI), which lend themselves well to the problem of entity resolution. MFIs are sets of tuples with maximal commonality over the set of values in a database [6]. Essentially this constrains a model to only make comparisons between sets of entities that share enough common items to exceed a set threshold.

So, what’s an effective blocking solution for this dataset? I chose to compare a record from one dataset to only the 100 closest records in the other dataset using a nearest neighbors search with vector embeddings. Hence, we are only making comparisons that have the highest likelihood to be real matches. Utilizing embedding similarity allows us to overcome common suffixes and prefixes found in medical terminology, such as “Dr.”, “MD.”, and “RN”, which can often cause matching issues.

To do this, I first embedded my providers full name using a Fasttext model. Fasttext is a word embedding model similar to word2vec but improves upon the architecture with added features like bags of n-grams that capture semantic meaning in sequences and improved efficiency [7]. Using Fasttext embeddings creates a 50-dimensional embedding from strings, which can be compared in vector space for semantic similarity.

#Embedding the fullname using fasttext
df_providers['vector_p'] = df_providers['fullname'].apply(
lambda x: ft_model.get_sentence_vector(x) if not pd.isnull(x) else None)

df_grantees['vector_g'] = df_grantees['fullname'].apply(
lambda x: ft_model.get_sentence_vector(x))

I also implemented a Hierarchical Navigable Small World graph index using hnswlib to organize and query these embeddings. HNSW is a powerful search method designed for quick nearest neighbor searches in an n-dimensional space [8]. It leverages graph structures to construct skip lists within the vector space, enabling rapid retrieval of nearest neighbors when provided with a query point. In this case, each grantee row queried the 100 closest providers.

A representation of hierarchical navigable small world graphs. Image credits: Hierarchical Navigable Small Worlds (HNSW) | Pinecone

This embedding approach spares the computational expense of comparing thousands of doctors and grants that are unlikely to match, offering an efficient solution to the blocking problem.

An example of comparing grantees to only the most similar doctors

The goal of this entity resolution model is to track the flow of money from grants to doctors. Being able to trace where providers are receiving their money can uncover potential biases, as doctors receiving large grants from organizations like big pharmaceutical companies can be evaluated in a new light with this knowledge. Knowing which doctors are well-funded and from which grants can also help identify key researchers and influential players in specific areas of medicine.

I plan to upgrade my model by training it on newly discovered matches and using past grant datasets to improve the size and quality of the data. As new grants are awarded, the model will only grow more robust and capable of identifying hard-to-detect edge cases. And with the infrastructure in place (datareader classes, databases, classifier architectures) to handle continuous, large data streams, the effort in iterative training will be minimal.

Entity resolution is an exciting field of machine learning that promises tremendous breakthroughs by defining, condensing, and validating large sets of data. These new datasets can reveal previously undiscovered secrets and insights into industry, medicine, and science.

Please check out my repository to see exactly how I tackled this problem with all the accompanying code: jakemaz66/EntityResolution (github.com)

I hope you learned something valuable, reach out to my email mazurkiewiczj@duq.edu or leave a comment for any questions!

References

[1] Binette, O., & Steorts, R. C. (2022). (almost) all of entity resolution. Science Advances, 8(12). https://doi.org/10.1126/sciadv.abi8021

[2] Relational database vs flat file (differences & similarities). DatabaseTown. (2023, January 24). https://databasetown.com/relational-database-vs-flat-file-differences-similarities/#:~:text=Relational%20databases%20are%20more%20powerful%20and%20flexible%20than,not%20designed%20to%20handle%20complex%20data%20management%20tasks.

[3] Wang, Y., Qin, J., & Wang, W. (2017). Efficient approximate entity matching using Jaro-Winkler distance. Lecture Notes in Computer Science, 231–239. https://doi.org/10.1007/978-3-319-68783-4_16

[4] Chen, T., & Guestrin, C. (2016). XGBoost. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/2939672.2939785

[5] Salton, G., & McGill, M. J. (1987). Introduction to modern information retrieval. McGraw-Hill Intern.

[6] Kenig, B., & Gal, A. (2013). MFIBlocks: An effective blocking algorithm for entity resolution. Information Systems, 38(6), 908–926. https://doi.org/10.1016/j.is.2012.11.008

[7] Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of tricks for efficient text classification. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. https://doi.org/10.18653/v1/e17-2068

[8] Malkov, Y. A., & Yashunin, D. A. (2020). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824–836. https://doi.org/10.1109/tpami.2018.2889473

--

--

Jacob Mazurkiewicz

Data Science Student at Duquesne University. Data Science Intern, Undergraduate Machine Learning Researcher, and AI Ethics Fellow