Identifying heart failure patients at high risk using MPC

Marie Beth van Egmond
The Sugar Beet: Applied MPC
6 min readJul 1, 2020

Machine learning algorithms are widely used to improve health care, for example to identify risk factors for a disease. Results can be used for the development of new treatments. For training these algorithms, a lot of data are needed, often divided over different data sources. In practice, however, combining these data sources is both legally and technically challenging.

When scientific researchers want to use personal retractable medical data for machine learning algorithms, this needs to be in line with the General Data Protection Regulation (GDPR). Local data can often used for scientific research purposes, however it becomes challenging when data from different sources needs to be combined. This is because, first of all, the GDPR focuses on using as little data as possible, while machine learning thrives on large datasets. Secondly, consent from the patient is often needed, which is time-consuming and causes practical problems, for example because the hospital is no longer in contact with the patients.

This is where Secure Multi-Party Computation (MPC) solutions come in. In this blog we will tell about our solution for training a machine-learning algorithm with data distributed over multiple parties.

Although we do not use real patient’s data, the set-up of our MPC solution is inspired by the following real-life situation. In Rotterdam, there is a group of patients that both is insured by insurance company Zilveren Kruis and took part in a program by hospital Erasmus MC. On one side, Erasmus MC has data on the lifestyle of these patients, for example their exercising behaviour. On the other side, Zilveren Kruis has data on different attributes such as hospitalization days and health care usage outside the hospital. These datasets, once combined, could be used to train a prediction model that identifies high risk heart failure patients. However, concerns about privacy and consent (to name a few) mean that these parties cannot simply share their data to allow for a straightforward analysis.

That is why, in 2018, the Netherlands Organisation for Applied Scientific Research (TNO), together with Erasmus MC and Zilveren Kruis, started a pilot within the H2020 Project BigMedilytics to develop a secure algorithm to predict the number of hospitalization days for heart failure patients. Note that we only use synthetic (though realistic) data during this MPC demo. With this data, we use MPC to calculate the relation between the number of hospitalization days and possibly important factors such as lifestyle. We focus on securely training the algorithm, with a prediction model as output; see Figure 1 for a simplified example.

Figure 1: A simplified example of the involved data. Zilveren Kruis and Erasmus MC have data on the same patients (blue dots), but the former only on hospitalization days and the latter only on exercising. The data cannot be combined as in this picture, but with the MPC solution, the prediction model (red line) can be calculated.

Once the algorithm is trained, the resulting coefficients are revealed to the participating parties. These results can be used in a non-encrypted way for applying the model on a single patient. In that case, getting consent from this one patient is much more straightforward. The data are at Zilveren Kruis and Erasmus MC and in theory they could directly execute an MPC protocol together. However, healthcare intermediation company ZorgTTP is also involved in the protocol on request of both Zilveren Kruis and Erasmus MC. Technically this also has an advantage, because the involvement of the third party makes the implementation faster; furthermore, the secure regression protocol that we use requires at least three parties.

Our solution consists of two phases. It starts with a Secure Inner Join, that is needed in preparation of the Lasso regression.

Secure Inner Join

In this artificial set-up, Zilveren Kruis and Erasmus MC both have synthetic data on non-existing patients. The attributes of these datasets are different, but some of the patients are present in both data sources, i.e. the datasets are vertically distributed. A first challenge lies in combining the two datasets in such a way that the parties involved do not learn which patients are in the inner join. To this end, we developed a protocol for Secure Inner Join (SIJ). The input of this protocol is both datasets from Zilveren Kruis and Erasmus MC, the output is a secret-shared version of the combined database, which involves the associated attributes of the patients that are in both initial datasets.
In the SIJ protocol patients are matched by an identifier (ID, based on birth date and postal code). The first step is that both parties use a keyed hash to encrypt the IDs and homomorphically encrypt their attributes (see step 1 in Figure 2). Both parties send this encrypted dataset to ZorgTTP (step 2). The hashed IDs can be matched by the third party ZorgTTP (step 3), which cannot actually see the original IDs . Once the intersection is determined, an interactive protocol takes place to convert the encrypted intersection into plaintext secret shares (step 4). Finally, ZorgTTP sends the shares to Zilveren Kruis and Erasmus MC. The result of the secure inner join is an additive (2-out-of-2) secret sharing, that can easily be converted to a different linear secret sharing. In this case, we convert to Shamir Secret sharing that is implemented in MPyC, the library that we use to build the secure regression.

Figure 2: Secure Inner Join protocol.

Lasso regression

Once the intersection with all attributes is secret-shared, the secure regression can start. In this project we implemented the Lasso (Least Absolute Shrinkage and Selection Operator) regression in MPyC, a Python library for MPC based on Shamir Secret Sharing.
The input of the secure regression is the secret-shared intersection of both datasets, the output consists of the plaintext coefficients of the regression and the goodness of fit.
We chose Lasso regression because this method results in a sparse model with few coefficients; some coefficients can become zero and are eliminated from the model. This is convenient because it tells us which coefficients matter the most in the model.The optimal coefficients are found by a Gradient Descent (GD) algorithm, an iterative optimization algorithm. The secure version of the GD algorithm could also be used as a building block on which different predicting algorithms can be build, such as the classification model Support Vector Machine.

As a result of the protocol, Zilveren Kruis and Erasmus MC receive the coefficients of the regression, trained on their combined synthetic input data (e.g. the red line in Figure 1). In a real-life setup, the result of this protocol would be that if one of the parties wants to predict the number of hospitalization days of a new patient, they only need the consent of this one patient for getting the data needed to do the prediction.

Demonstration

On 1 July 2020, we ran a demonstration of our solution with synthetic data between the three organisations, Zilveren Kruis, Erasmus MC and ZorgTTP. Every party has a Linux VM (in Azure in Western Europe, Rotterdam, Frankfurt respectively), that communicates with the other VMs over the internet; connections are secured with TLS. In Figure 3 a screenshot of the demonstration is shown.

Figure 3: Screenshot of the demonstration of the secure solution, from the perspective of Erasmus MC. It shows the different steps in the protocol.

We tested the performance and scalability using artificial data, by measuring the running time and complexity over various phases of the protocol. In Figure 4 you can see the performance results. Performing the secure regression on a dataset with 10.000 patients and 10 features takes a bit over half an hour.

Figure 4: Performance results of secure Lasso ran on three servers.

Conclusion

In this project, we found that our MPC solution has potential to obviate the current complicated process of data coupling. It can provide a solution to the GDPR discussion on data minimization when combining data. The mathematical guarantees for the patients’ privacy ensures accurate prediction models without sacrificing privacy. Therefore the application of MPC will result in more data, and hence more trustworthy results.
Altogether, the first results of our demo look very promising. With synthetic data we tested that we can run a regression on 10.000 patients an 10 features in half an hour. In the future, it is essential to investigate what hurdles need to be overcome within an organisation to start using MPC in this way, such as legal and compliance aspects. We are confident that such steps will result in an MPC pilot with real medical data in the near future.

The BigMedilytics project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 780495.

--

--