The Ongoing Quest: Addressing Open Issues in Pseudonymization within Federated Learning Frameworks

Published in

ELCA IT

7 min readDec 21, 2023

In today’s data-driven world, personal data is essential for informed decision-making. Protecting this sensitive information is imperative, more so under stringent privacy laws such as the European Union’s General Data Protection Regulation (GDPR) or Switzerland’s Federal Act on Data Protection (FADP).

This article is based on the efforts undertaken at SUMEX AG, a leading provider of ERP and claims management software for the Swiss healthcare system and part of the ELCA group. SUMEX uses machine learning to identify errors and anomalies in medical invoices. The goal of its next-generation solution is to train machine learning models using data from multiple insurers. However, this process faces challenges due to the sensitive nature of the processed data. Insurance companies may be hesitant or restricted from sharing data, either among themselves or with SUMEX, due to competitive and regulatory considerations.

These challenges prompted our exploration of innovative solutions, guiding us to the fascinating world of federated learning. Federated learning, a decentralized form of machine learning, enables collaborative model training while complying with privacy regulations. However, Federated learning may not suffice and other privacy-enhancing technologies have to be integrated. A popular privacy-enhancing technology is pseudonymization. While the combination of federated learning and pseudonymization offers great benefits, it introduces new challenges, as we discovered at ELCA. This report discusses these challenges and proposes strategies for addressing them.

Understanding the Technologies

Federated learning is a form of distributed machine learning, in which each party trains a local model using their private data. Only the model’s parameters, not the raw data, are shared with a central server. The server aggregates these parameters to create a global model. The strength of this innovative technology is its ability to facilitate collaborative training while preserving data privacy due to only model parameters rather than training data being exchanged. This collaborative process creates global models trained on larger datasets that may be more robust and accurate. Additionally, the global model can then be tuned to the specific needs of each client. However, recent research has shown that there are certain vulnerabilities, such as model inversion or inference attacks. [1] Therefore, additional privacy-enhancing technologies are needed to prevent the leakage of sensitive data. To learn more about the different federated learning frameworks in Python, check out the medium article, written by fellow ELCA colleague Alex Braungardt.

Pseudonymization, on the other hand, is a privacy-enhancing technique. It describes the replacement of personally identifiable information (PII) with pseudonyms such that it can no longer be attributed to a specific data subject. This adds an extra layer of privacy protection by replacing sensitive personal data (e.g., names, or addresses) with hashes or encrypted values that preserve identifiers and relationships. Hashing is a one-way operation and is generally not reversible. However, any party in possession of the original PII and the parameters of the hash function can re-compute the pseudonym and thus map it to its original value. Encryption transforms the PII into pseudonyms that can be decrypted with a secret key. Pseudonymization is valued for its simplicity and ease of implementation. It ensures that PII can only be interpreted by parties that have access to the necessary secrets while enabling everyone else to see relationships and perform aggregations. In this way, it balances data privacy and utility with important limitations. From a legal perspective, pseudonymized data remains personal data with all legal consequences that come with it. Additionally, measures must be taken to ensure that pseudonymized data cannot be re-identified by unauthorized parties by cross-referencing with other available information or by gaining access to hash tables or decryption keys. Therefore, access must be secured and additional privacy-enhancing technologies may be necessary. [2]

Nevertheless, under the right circumstances, pseudonymization protects sensitive information and it is very attractive for machine learning and statistics use cases.

Challenges of Pseudonymized Data in a Federated Setup

We consider a federated framework that involves:

K parties: These parties collectively train a model using their respective local private data.
Server: This central entity facilitates communication across the network and performs model aggregation.

In combination, data-sharing, as done in federated learning, and pseudonymization lead to new challenges. In particular, one advantage of pseudonymization is the ability to aggregate and join data based on pseudonyms. But in a multi-party setting where each party provides individually pseudonymized data, this will not work.

To illustrate, consider a model trained on data from multiple healthcare providers (parties). For federated model training, each party pseudonymizes their data using a different pseudonymization secret (e.g. using symmetric encryption with different secret keys). This makes it difficult for the server to associate a patient who has records in the data sets of different parties because the pseudonyms do not match.

E.g., if Hospital 2 and Doctor 2 share a common patient named Max, the server cannot recognize this, because Max from Hospital 2 and Max from Doctor 2 is not pseudonymized in the same way. This is called non-uniform pseudonymization and prevents the recognition of common identifiers across data sets. Therefore, the benefit of having a bigger dataset may be lost, because the server cannot aggregate the information correctly.

Pseudonymization by Different Parties — schema by author

This issue extends beyond healthcare and can affect the training of models on many types of company data.

Uniform Pseudonymization

Before exploring potential solutions, we must keep our privacy objectives in mind.

Intermediate results exchanged between parties and the server should not leak personal information to other parties
The global model should not allow any party to learn new specific PII from the private data of other parties

The obvious solution to the challenge above is that all parties use common pseudonyms, e.g. a common shared secret key provided by a trusted third party — a concept termed “uniform pseudonymization”. However, a drawback arises as the parties gain the ability to de-pseudonymize the data of other parties, conflicting with the second privacy objective, while adhering to the first one. In general this is only a drawback when the parties get access to new pseudonyms (through the shared final model or model attacks). Therefore, this approach is only suitable if the final model is never shared (in the case of encryption) and additionally, the PII is not from a closed set of values (in the case of hashing).

Overview of Different “Uniform Pseudonymization” Scenarios — schema by author

However, most of the time the personal data in the model is exactly what clients care about. If we consider a use case such as fraud detection, where insurance companies want to identify who exactly committed fraud, then the information about a data subject showing fraudulent behavior is exactly what we are interested in. In such scenarios, maintaining data privacy against other parties while ensuring uniform pseudonymization becomes especially challenging. Despite recent research efforts to enhance privacy preservation in machine learning, enforcing uniform pseudonyms without allowing for de-pseudonymization by parties remains an open challenge. In cases where avoiding de-pseudonymization is crucial, alternative methods should be considered. For instance, in fraud detection, using a clustering algorithm could be a viable solution, the principle idea being that individuals who commit fraud at one insurance are likely to do so at another. Each party would use their own pseudonymization algorithm, thus ensuring both privacy criterias. The server clusters the data based on non PII attributes, i.e. based on behavior. Closely clustered points can be considered as the same person, therefore enabling aggregation. This approach offers a privacy-preserving alternative without the need for uniform pseudonymization.

Conclusion

In summary, the integration of pseudonymization into federated learning emerges as an elegant solution, for balancing data utility and privacy. The advantages are clear, however, our exploration has shed light on persistent challenges surrounding the uniformity of pseudonyms and associated privacy concerns. Acknowledging these hurdles, we underscore the critical need for innovative solutions.

For SUMEX, these findings bear direct relevance, affecting the way data privacy and collaborative learning are approached. As federated learning gains popularity, the need for practical strategies to navigate pseudonymized data challenges becomes increasingly urgent. Looking ahead, a strategic focus on overcoming these challenges, possibly through the adoption of innovative encryption methods, will be pivotal. This focused approach is poised to unlock the full potential of this powerful combination, contributing significantly to the evolving landscape of data privacy and advancing SUMEX’s commitment to cutting-edge, responsible data management practices.

References

[1] J. Zhang et al., “Security and Privacy Threats to Federated Learning: Issues, Methods, and Challenges”, Hindawi Security and Communication Networks, 2022.

[2] G.A. Kaissis et al., “Secure, privacy-preserving and federated machine learning in medical imaging”, Nature Machine Intelligence, Vol. 2, 305–311, 2020.