How to Develop Machine Learning while Maintaining Data Privacy?

Machine learning without data dissemination or what is Federated Learning about.

Published in

Axioma AI Journal

4 min readJun 29, 2022

Currently, large technology companies collect data from customers and develop various intricate artificial systems based on this collected data that can predict people’s behavior. But in fact, this knowledge from ML methods, mainly deep learning, does not always serve for customers, which is why it is so important to preserve confidentiality and our data. But is it possible?

1. Basics

First of all, it is necessary to understand how any approach to machine learning works. Any machine learning learns from a sample of data and the output model directly depends on the quality of the input data. Every Machine Learning approach could be described with following steps:

Data mining (First you need to collect data from somewhere and somehow. It is often necessary to collect information from different sources and because of the different formats of these resources, the data turns out to be unclean and ambiguous.)
Data Cleaning (The next step is to clean up this data and obtain such results on the basis of which further training is possible. This step includes data analysis, data manipulation, feature development, and more.)
Model Training/Testing/Evaluating (As a result, the most technical aspect of the game is the development of a mathematical algorithm that can not only analyze, but also predict. In this field, the most important role is played by the interdisciplinary concept of mathematics, computer science, the applied sphere and much more.)

2. Problem definition

In the classical machine learning approach, all data is processed on a single server. Absolutely all stages take place in one data center and, therefore, the owner of computing equipment has access to everything. The problem lies in potential attackers who can steal data or results, on the basis of which it is possible to claim loud and often expensive things.

Okay, if consumers suffer only financially when they buy another microwave oven for 20 percent more expensive because of targeted advertising, but when it comes to such important personal data as private photos, correspondence or even your DNA, the outcomes can be much more deplorable.

3. Problem solve

To solve this problem, a specific machine learning technique called Federated Learning has been developed. Federated learning (also known as collaborative learning) is a machine learning method that trains algorithms on different decentralized servers storing local data samples. It turns out that the system trains on absolutely all data and the output is the largest possible, but the data itself is not disclosed and remains private.

This approach differs from traditional centralized machine learning methods, where all local datasets are uploaded to a single server, as well as from more classical decentralized approaches, which often assume that local data samples are distributed equally.

In order to understand the difference clearly, let’s assume that there are several hospitals that want to develop a general machine learning model for predicting breast cancer. With federated training, each hospital trains a local model on its own small server, rather than sending raw data to a centralized server, thereby the hospital preserves the privacy of its patients. Hospitals act as remote clients and regularly interact with the central server to study the global model. For each iteration, hospitals send their local model to the server. The server calculates and sends back the global model. This process is repeated until convergence occurs or some stopping criterion is reached. That is, the model trains on all the data, but they themselves do not spread over the network and are not transmitted to central servers.

But it’s worth noting that any change has both good and bad sides. For example, federated learning has new time limits, there is a need for a specific data structure, and much more. If you are interested in learning more about this type of machine learning, let me know so that I can make an extended article.

Finally

I want to say that such specific and unique data as the DNA sequence should remain confidential and not be disseminated. But for the development of our science and technology, we need this very data, and we must somehow find it. Federated learning is a great concept for this approach, which can provide two very important things: privacy and scalability. In the future, we will see more than one commercial project with these algorithms and look at the actual results of the effectiveness of this type of machine learning!

Be human, do science 🕊

🔔 Loved this Article & Want more?
📩 Feel free to follow and subscribe to my newsletter.

🔍 New in Medium?
📌 Join the largest community!

🔍 Interested in Science and Bioinformatics particularly?
📌 View my other Articles.

❓ Have questions?
✅ Feel free to contact me on:
🔘 Linkedin
🔘 Twitter