How to evaluate Electronic Health Records representations for Machine Learning models?

5 min readApr 27, 2023

In this post I share some different approaches to represent electronic health records (EHR) data, in order to use EHR data as an input for a variety of machine learning algorithms. As there are many options to represent the data, this post is focused on how to compare them, to evaluate and to choose the right approach.

Background:

With the expansion and advancement of technology in healthcare systems, we can gather EHR from various clinical settings and frameworks. EHR data is a lot more available today than it was in the past.

As mentioned in previous research, representations in the EHR domain have great potential to help, improve, or solve questions regarding patients health status and medications efficiency. Proper analysis of available information can lead to significant discoveries, like understanding disease correlation and predicting results of medication use.

The purpose my research was to find a lean data representation, with which we are able to run different models, classifications and regressions. The idea is to find the representation that represents the initial data, but focuses on the most significant signals, and loses as little significant information as possible. With the right representation, the input of different Machine Learning models will be focused on significant elements in the data, the essence.

The ability to compare between the different representations is not trivial and there are multiple ways of how to compare and evaluate the representations.

The Data:

So we understand the great potential, but what is it? What data is available as part of Electronic Health Records?

An electronic health record (EHR) is a digital version of a patient’s paper chart.

A collection of medical codes, each type is stored as tabular data in a text file.

However, with this type of data there are some problems that make it difficult to use EHR data as an input for machine learning algorithms, and get relevant results — sparseness, noise, and bias data pose many complex challenges.

The data in the research is EHR of Covid-19 patients, before the Covid-19 infection. As EHR is not very accessible, particularly due to patients’ privacy, there was a collaboration with TriNetX, which is a global health research network that connects pharmaceutical companies with researchers and academics to the world of drug research and development, by sharing real-world data.

The representations:

There were 5 alternative approaches for the representations of EHR data, and this article is focused on how to compare them. For the representations, I adopt principles that were proven for other domains (Recommendation systems, NLP) and execute them for the healthcare domain.

The evaluation:

When trying to evaluate the quality of the representations we come into a difficulty of what to consider as a good representation and how to compare one representation over the other. For this challenge, I use three approaches: supervised evaluation, unsupervised and Mixed approach (Unsupervised and Supervised)

1. Supervised approach

For the supervised approach I had to define the labels for the evaluation task, use different classifiers, and measure the accuracy.

I define the label of each patient as his/her medical diagnosis, during a relevant time frame of his EHR, where he has only one diagnosis. Since the representations are not the only factor that affects the accuracy, but also the models I choose, I ran 6 different models with the same representations (Nearest Neighbors, linear SVM, RBF SVM, Random Forest, Neural Net and Naive Bayes). For the vectorized representations we are able to measure the accuracy also by the created networks.

A representation that leads to a high accuracy in at least one model is considered as a good representation.

I chose to divide the population into 2 classes, based on their pre-existing medical conditions. I chose the classes to be:

Patients with pre-existing ”Asthma” diagnosis
Patients with pre-existing ”diabetes type 2” diagnosis

This population was chosen due to the available data of Covid-19 patients, and according to clinical studies that showed correlation between those 2 types of background illness and how the Covid-19 evolves.

In order to compare the same data in different approaches, I use this population for the other approaches as well.

Accuracy of different models when using the classical embedding

2. Unsupervised approach

In our case, the original task is not a classification task. Therefore, it can be that for our purposes a representation with a lower accuracy has its own advantage. I believe that accuracy of the defined labels is not the only approach to use and for that reason I use other measurements, stem from the unsupervised setting.

I first use an unsupervised model for clustering, and then I measure Elbow and Silhouette. We would consider a good representation as a one with a clear Elbow graph, as this represents the ability to separate the samples into classes. Although I classified the patients’ data with two labels, we know that the population can be divided in many different ways, and can be classified into more classes. The unsupervised approach can reflect this idea.

The Elbow graph of the VAE representation. The decrease is very sharp. The sum of squared distances between data points and their assigned centroids are low.

3. Mixed approach (Unsupervised and Supervised)

The third approach is a combination of the supervised and unsupervised approaches — I use both the clusters and the labels. The measurement is Purity. As more samples in each cluster belong to the same label, the representation is better.

Summary:

To sum up, I compare the different representations’ quality with 4 known measurements:

Accuracy
Elbow
Silhouette
Purity

The main conclusion is that one type of representation can not properly fit for all future tasks, and the representation should fit to the relevant task. A representation is considered as good and valuable depending on the task. The variety of options, pre-built for multiple tasks, can have a significant contribution to the healthcare domain.