Linking healthcare registries to improve medical research

9 min readApr 20, 2023

A healthcare registry holds information about a specific patient population with a particular health condition. Typically, researchers analyze the information collected in a registry for various purposes, such as evaluating the effectiveness of healthcare treatments, monitoring disease prevalence and more.

Allowing the collection of data across different healthcare registries can give significant benefits to medical researchers and the quality of their research. Firstly, aggregating data from multiple data sources (some of which may even be cross-border) increases the size of the researcher’s sample set, improving results of statistical analysis. Additionally, having access to different types of registries can enable researchers to discover correlations in the gathered data which would not be found otherwise.

However, it is challenging to link disparate registries. Difficulties may arise due to a variety of legal and technical reasons, such as registries using incompatible data schemas, data exchange limitations, data privacy requirements, and security concerns.

As part of the European Union-sponsored Horizon2020 HEIR project (www.heir2020.eu), we developed a prototype that demonstrates a privacy-aware framework (PAF) capable of allowing researchers and patients to query data across an aggregation of healthcare registries while preserving privacy requirements through policy-driven transformations. Specifically, we created a Kubernetes-based framework that enables users to obtain all their medical observation records from simulated versions of Diabetes, Prescription, and Cardiovascular Disease registries, and allows researchers to obtain anonymized-by-policy versions of these records linked across all registries.

This blog will give an overview on how we created our prototype, looking into the technical decisions that were taken which ultimately shaped our solution.

Handling FHIR

FHIR, or Fast Healthcare Interoperability Resource (https://www.hl7.org/fhir/), is an emerging standard from the Health Level 7 standardization body. It is designed around Internet communication protocols and creates a standard model for the representation and transfer of healthcare data. Currently, FHIR defines over 140 healthcare environment models called Resources, such as Observation, Patient, and Consent. FHIR has gained rapid acceptance in the medical world, and the US government has made it a key component of its interoperability initiatives, including the 21st Century Cures Act and the Office of the National Coordinator for Health Information Technology’s interoperability standards.

Therefore, prototyping our registries as stores of FHIR data is not only a future-looking healthcare direction, but also overcomes potential schema incompatibility issues between different, non-FHIR based registries.

FHIR resources can be expressed in JSON format. In our other healthcare use cases work done in HEIR (described in https://medium.com/fybrik/using-fybrik-to-create-a-privacy-aware-framework-to-access-fhir-data-245aa1a4a6a4), we stored FHIR data in a FHIR server and used the FHIR query language to extract information. It would certainly be reasonable here also to implement the healthcare registries using FHIR servers. Yet, there are none-the-less disadvantages to this approach for a use case which links the backend data stores. In particular, the FHIR query language does not allow joins across resources like SQL does — a major drawback. We therefore decided to build our prototype registries using the Open Source, relational Postgresql database (https://www.postgresql.org/) which has native support for JSON and will give us support for SQL JOINs.

Using the Synthea emulator (https://synthea.mitre.org ), we obtained synthetic patient information in the form of FHIR bundles, which are a collection of FHIR resources. While a FHIR server can directly import a FHIR bundle, importing a bundle directly into Postgresql requires a lot of effort — SQL tables need to be created for the FHIR resources and the FHIR resources in the bundle need to be separated out and subsequently reformulated as separate SQL import commands. Here we were helped by Fhirbase (https://www.health-samurai.io/fhirbase) which did this heavy lifting for us, ultimately creating three separate, populated databases, emulating three different healthcare registries.

Linking the registries

Once we had our FHIR data stored in separate databases, we needed a technology to enable linking, or federating the separate databases — ideally allowing us to perform a single SQL SELECT query (for JSON data!) across all tables in all registries. We also wanted to build an Open Source solution which would be scalable, high performing, and even flexible enough to accommodate other forms of backend data stores for future enhancement of our use case.

We selected Presto (www.prestodb.io) for this purpose. Built as a distributed SQL query engine, Presto features connectors to many datastores, including SQL databases, NoSQL databases, data warehouses etc.

Fybrik — policy-driven data protection

Having tackled storing FHIR data and enabling SQL queries against this distributed data, we needed a way to enforce policy governing the sharing of healthcare data.

Together with our healthcare partners in the HEIR project, we defined a goal of creating support for a policy that allows data owners (e.g. a patients) access to all of their records, while requiring a transformation, or redaction of sensitive data to authorized third -parties (e.g. authorized researchers). Further, we want to be able to define and operate on sensitive data at the FHIR attribute level — which is a much finer-grained resolution than the resource-level data access which is currently offered by the FHIR standard.

To accomplish these objectives, we employed Fybrik, an Open Source framework developed by IBM (www.fybrik.io), built on top of Kubernetes. I have previously written about our use of Fybrik to enforce policies for FHIR data (https://medium.com/fybrik/using-fybrik-to-create-a-privacy-aware-framework-to-access-fhir-data-245aa1a4a6a4), so I will just provide a brief overview of it here.

The conceptual architecture of our privacy-aware framework is shown in Fig. 1. A Fybrik module (https://fybrik.io/dev/concepts/modules/) functions as a mediator between the data requester and the data source and not only connects to the data source but also transforms or redacts the requested data to comply with privacy requirements. We use Open Policy Agent (https://www.openpolicyagent.org/) to implement a Policy Decision Point (PDP), which when queried, returns to the Fybrik module an action, which defines what needs to be done on the data before it can be returned to the requester. The Fybrik module then acts as a Policy Enforcement Point and carries out the required action.

Parameters required by the PDP, such as requester role, will be provided by a cryptographic certification, called a JSON Web Token (JWT) as we will later discuss.

Enforcing data privacy at the FHIR attribute level

In our use case, we want to allow patients to obtain all their Observation records from the linked registries, and to allow researchers to obtain all Observations records, but with Personal Identifiable Information (PII) redacted.

Typically, a Data Privacy officer will classify in a catalog which data should be considered as PII, and this is then made available to the Fybrik module.

In our previous work (https://medium.com/fybrik/using-fybrik-to-create-a-privacy-aware-framework-to-access-fhir-data-245aa1a4a6a4), we showed how to achieve attribute-level redactions using the Privacy-Aware Framework in conjunction with a FHIR server. In this work, patient records were stored in a FHIR server, and the data requester accessed the data using the FHIR query language. Consequently, when PAF intercepted the returned data, it was straightforward for the Fybrik module to identify the PII fields in the returned data and redact if required, since the data was returned in the prescribed format of the FHIR resource schema.

However, things are more complicated when the FHIR JSON data can be accessed through SQL queries. For example, let’s consider a simple example where we want the family name in following Patient resources snippet to be redacted. (See Fig. 2).

When using a FHIR query for Patient information, we can be confident that the “name.family” structure will be preserved in the returned data. In this case, our Data Privacy Officer just needs to classify “name.family” in the Patient resource as PII in order for PAF policy enforcement to work.

However, a Postgresql query can get access to the family name through queries at different levels in the JSON hierarchy. For example, the following queries will all return information containing the family name, although it will be in different data hierarchies:

SELECT resource from patient;  will return the whole resource

SELECT resource->’name’ FROM patient; will only return everything under “name”

SELECT resource->’name’->0->’family’ FROM patient;  will only return “Salant”

It is now hard for the Fybrik module to figure out what needs to be redacted in the returned data, since there will no longer be a one-to-one match between the returned data and format of the PII classification.

Along with developing a solution which will enable the Fybrik module to intercept sensitive data, we needed to protect the datastores — meaning that required keys or usernames/passwords to the registry databases need to be securely handled by the Privacy-aware Framework and not exposed to the data requester. It also quickly became obvious that we required a solution which would eliminate the need for a data requester to formulate the complex SQL queries needed to extract the distributed JSON data.

Taking all these into account, we built a solution where the Fybrik module exposes a number of REST endpoints, each of which returns a different slice of the FHIR data. The syntax for the REST calls to these endpoints is very similar to the FHIR query syntax (for simple queries). In fact, (without going too deeply into details), our Kubernetes-based solution can be configured so that the exposed REST endpoints look like endpoints into a FHIR server — while behind the scenes redirecting data requests to the Privacy-Aware Framework.

When a data request is received, the Fybrik module collects the information and keys required to link the registries, and then formulates and executes a unique SQL SELECT query for each endpoint. For example, there is an endpoint to return all Observation records which uses an SQL JOIN across all registries and another endpoint to return all Observation records corresponding to a given patient id.

Since both patients and researchers can freely send requests for data to the REST endpoints, the data privacy policy needs to be evaluated by PAF’s PDP at runtime. Additionally, we need a way to make sure that data requesters are who they claim to be to prevent them from masquerading as different identities in order to access data which is restricted to them. How user authentication is done won’t be discussed here; instead, we will start with the cryptographically signed certificate, called a JSON Web Token (JWT) that authentication will produce.

As mentioned earlier, to support runtime policy decisions, all REST requests to the endpoints need to be accompanied by this produced JWT which encodes the user credentials required by data access policy, such as the role (“researcher” or “patient”), the patient id (if of role “Patient”), the organizational affiliation (if of type “Researcher”) etc. The PAF is therefore immediately able to block an unqualified user from accessing an endpoint. For example, a request without a role of “Researcher” in its JWT will not be able to access the “allrecords” endpoint, and a request from a Researcher will not be able to access the endpoint for records corresponding to a specific patient id.

Since the Fybrik module constructed the query necessary to build each endpoint, it has control and knowledge over the format of the data to be returned — avoiding the problem with locating the data to be redacted in unrestricted queries which we discussed earlier.

Once a data request passes the first level of enforcement (“Is the requester in the JWT allowed to access this endpoint?”), the Fybrik module then applies the policy action to the data slice returned by the endpoint. For example, a JWT specifying that the requester role is a “Researcher”, being sent to the “allrecords” endpoint, will cause the Fybrik module to redact all PII data in the returned patient records.

For a researcher wishing to do further analysis of the data, it is straightforward to query for data from a framework like a Juypter notebook and then incorporate the data into a data analysis environment, such as a Pandas dataframe in Python.

Summary

We have created a conceptual solution to show how siloed healthcare registries can be transparently linked to allow for more powerful results from querying patient records.

We assume that the data will be stored in FHIR since we believe that this is the future for healthcare registries, however it would be possible for the Privacy-Aware Framework to silently handle schema translations behind the scenes for different data registry formats.

We have demonstrated a simple data protection policy. However, the definition of policy rules is independent of the executable PAF code, meaning that a much more sophisticated and realistic set of rules could be declaratively configured for this use case without requiring recoding or a reconfiguration of the developed solution.

The research leading to these results has received funding from the European Community’s Horizon 2020 Research and Innovation Programme under grant agreement n° 883275.