Embedding medical journeys with machine learning to improve member health at CVS Health

Piero Ferrante
CVS Health Tech Blog
8 min readJul 19, 2022

--

By: Matthew Churgin and Jai Bansal (Matt and Jai are data scientists in the Analytics & Behavior Change department at CVS Health)

Embedding representations for categorical data on medical claims

Medical claims are one of the key data sources we use to understand health journeys. Claims are the data artifacts that result from our members’ interactions with the healthcare system. Analyzing claims data is complex. To understand why, it helps to know more about how health information is represented on a claim.

Claims contain data like the amount the provider billed, the place of service, and provider specialty. The primary medical information in a claim is represented in codes that indicate the diagnoses, procedures, or drugs for which a member was billed. These codes typically follow national standards. For example, our claims use International Classification for Diseases (ICD) codes to represent medical diagnoses, Current Procedural Terminology (CPT) codes to represent medical procedures, and Generic Product Identifier (GPI) codes to represent pharmaceuticals. These codes give us a semi-structured view into the medical reason for each claim and contain rich information about members’ health journeys. However, since the codes themselves are categorical and high-dimensional (>10K cardinality), it’s challenging to extract insight or predictive power directly from the raw codes on a claim.

To transform claim codes into a more useful format for machine learning, we turned to the concept of embeddings. Embeddings are used broadly in deep learning to represent various forms of unstructured data, such as images, audio, or categorical sequences (such as text or amino acid chains, for example). Here we use word embeddings as a motivating example as the task of learning claim code embeddings is highly analogous to that of learning word embeddings. Word embeddings are widely used in natural language processing to provide numeric vector representations of individual words. For example, embeddings let us quantify the notion that the words ‘man’ and ‘woman’ are more similar to each other than to the word ‘tree’. In addition, embeddings let us make analogies quantitative, like the classic relation that king is to queen as man is to woman.

We use a similar approach with our claims data. We treat each claim code as a word or token and use embedding algorithms to learn lower-dimensional vector representations that preserve the original high-dimensional semantic meaning. While we could represent each code as a one-hot encoded vector with length equal to the total number of codes (dimension >10K), compressing the data into dense vectors (embeddings) reduces dimensionality, enables distance computations, and facilitates modeling. The embedding vector dimension is a hyperparameter that can be chosen depending on the number of codes. We experimented with different embedding dimensions , but in practice found little difference in performance when varying this parameter.

Training embeddings from medical claims

Many algorithms can be used to learn embeddings, but the general approach is to randomly initialize an embedding matrix and then convert sequences of words (or codes) into a prediction task that can be optimized with gradient descent. This process converts the categorical features into dense numeric representations. Details of these methods are beyond the scope of this post; for a friendly introduction, we recommend Jay Alammar’s “Illustrated word2vec” blog post on one of the earliest and most popular embedding algorithms.

In our case, we use sequences of anonymized member claim diagnosis, procedure, and drug codes as training data. As illustrated in Figure 1 for ICD codes, we take a member’s sequence of claim codes as a string of tokens and feed these sequences into existing embedding algorithms. We independently learn embeddings for each of ICD, CPT, and GPI codes based on millions of member claim sequences. As shown in Figure 1, the result of the training process is a static lookup table where each code is uniquely mapped to a dense embedding vector that can used for downstream tasks.

Figure 1: Training data for ICD code claim embeddings. A similar process is used to train CPT and GPI code embeddings.

In terms of the training process itself, we experimented with both the word2vec and GloVe algorithms to learn embeddings for each type of claim code. As we did not observe significant differences between algorithms, we ultimately settled on using word2vec, as implemented by the Gensim library.

Evaluation

After training, the next step is evaluation. Training embeddings is an unsupervised learning process, so there’s no ground truth to check each code’s learned embedding against. We evaluated our embeddings from three different perspectives: 1) claim code representation, 2) member representation, and 3) predictive performance.

First, we asked whether the learned embeddings meaningfully represented the semantic information present in claim codes. To check this, we began by projecting embeddings into two dimensions and visualizing how codes with known labels were distributed in embedding space. This method let us subjectively evaluate if distances between codes made sense given our semantic understanding of the relationships between them.

In Figure 2 below, we’ve used the dimensionality reduction technique UMAP to reduce diagnosis code embeddings to two dimensions. Codes are color-coded by diagnosis group, a high-level grouping for related diagnoses, for example those related to neurological disorders, burns, or dermatology. You can hover over the plot to inspect individual code descriptions.

We can see pregnancy-related code embeddings clustered together on the left. To the lower right of the ‘Pregnancy’ cluster, we see codes related to gynecology, which makes sense since both groups relate to women. To the lower right of the ‘Gynecologic’ cluster, we see a ‘Breast cancer’ cluster.

Moving counter-clockwise around the plot, we encounter more cancer-related codes before transitioning to brain-related diseases. This plot shows that the embeddings have preserved our intuitive sense of relationships between diagnosis codes, which means they have captured quantitatively what we previously knew only qualitatively. That gives us confidence in using embeddings for other analysis and modelling tasks.

Figure 2: Continuous representation of diagnosis claim codes. Axes represent each code’s embedding vector in two dimensions after dimensionality reduction with UMAP. Points are colored by ICD group.

The second way we evaluated the trained embeddings was to explore how they represented our members and ask if results conformed to our qualitative understanding of differences in health journeys based on demographics. For example, we know qualitatively that 1) members in different age groups typically experience different diagnoses and procedures and 2) men and women largely undergo similar procedures (besides obvious exceptions like pregnancy or certain types of cancer). To test these ideas, we took members’ claims in a given time period, retrieved embeddings for each code on those claims, and computed the average of those embeddings. Figure 3 shows this process for a fictional member with 3 claims. The resulting average embedding is now a holistic vector representation of a member’s medical journey. After doing this for all members, we used the same process as in Figure 1 to plot each member in two dimensions.

Results are shown in Figure 4. In this case, each point represents a member’s aggregated procedure codes (not diagnosis codes as in Figure 2). Members are color-coded by age (Figure 4, left) or gender (Figure 4, right). We see that members of different ages are clustered in different regions of the embedding space, supporting our hypothesis that the average member health journey changes with age.

We see from the gender-labelled plot that males and females are mostly interspersed in their average embedding, with a few notable exceptions. This indicates that men and women experience similar health journeys over the time frame analyzed. One exception is the blue island comprised of female members towards the top of Figure 4. Looking at this island on the age-labelled plot, we see these members are aged 18–35. So these members are likely on maternity journeys, which explains why they are all women (we inspected the individual procedure codes to confirm this is indeed the case). This is another example of how visualizing embeddings lets us quantify our qualitative understanding of member health journeys. As before, these visualizations showed trends that we expected, building confidence that the embeddings capture useful semantic relationships present in claims data.

Figure 3: Compute a member’s average embedding.
Figure 4: Average member procedure embeddings labelled by age (left) or gender (right). The red arrow highlights a cluster of members experiencing primarily pregnancy-related journeys.

Finally, we evaluated our embeddings by asking how they performed in predicting future healthcare-related behavior compared to a simpler baseline. The baseline we used was created by counting the number of claims per member that fell into each ICD or CPT group in the past year. ICD and CPT groups are high-level groupings of related codes, allowing us to represent claim history in a few hundred dimensions. Figure 5 displays health event predictions using either member claim ICD and CPT group counts (green) vs. members’ average ICD and CPT embeddings (orange) as features. After training logistic regression models to predict each health event, we see the average area under the curve (AUC) is 0.67 for models trained with embeddings compared to 0.63 for models trained with group counts, indicating that embeddings generally improve model performance over the simple baseline on average. Why do embeddings outperform group counts in most cases? It may be that representing all claim codes within a given group does not capture the same nuances as embeddings, which learn unique representations for every specific claim code, and these nuances may be necessary to predict future health events. It’s also interesting to note that certain events, like Caesarean section, are easier to predict overall than others, like emergency room visits, likely due to the inherent random (unpredictable) nature of certain events.

Figure 5: Claim-based member embeddings predict future health events better than group counts in most cases.

This evaluation method outcome is particularly exciting. Feature engineering is a critical part of modelling and requires substantial time and effort. Our results show that embedding features are relevant for multiple, often seemingly unrelated, predictive tasks. Therefore, they provide a method to quickly and automatically generate features, which can save data scientists effort and get baseline predictive models up and running quickly.

Next steps

Our roadmap includes automating embedding re-training to ensure continuous inclusion of new codes, experimenting with embeddings for other entities like providers or lab codes, and exploring graph-based embedding approaches to represent multiple entities like claim codes, members, and providers in a shared space. These workflows will hopefully help us continue to improve the fidelity of representing complex, unstructured data in a structured format to better support our members’ health journeys and CVS Health’s mission.

Using data science to drive value for members

The Analytics and Behavior Change (A&BC) department drives the analytics and insights underpinning CVS Health’s transformation of the healthcare industry. A&BC contains many teams and hundreds of data scientists and engineers. These colleagues work on a wide portfolio addressing health insurance front-store/retail analytics, pharmacy benefits management, retail pharmacy, and other CVS Health business spaces. The authors’ specific team works on health-insurance-specific topics with the overall goals of that improving our insured members’ health, reducing their out-of-pocket spend, and optimizing the company’s medical cost spending.

Our data science work typically falls into three buckets: 1) analyzing patterns in member journeys to identify pain points and new opportunities, 2) developing data products to improve member experience, and 3) building next-best-action campaigns that help keep members healthy. All of this requires data scientists to analyze member healthcare data and build models to generate insights that guide the business.

Healthcare data is particularly complex, making it both challenging and rewarding to understand how we can best support members on their health journeys. This post describes an internal data product called Member Embeddings that facilitates modeling of member medical journeys with machine learning.

--

--

Piero Ferrante
CVS Health Tech Blog

Data Science Fellow at CVS Health with 15 years of applied ML and engineering experience in healthcare, adtech, and fintech.