PySyft for secure and private Deep Learning

Published in

ELCA IT

6 min readMay 28, 2021

With the increase of digitalization and automation in many domains, the industry is evolving to rely more and more on the use of artificial intelligence. However, with the recent scandals concerning data privacy (Facebook–Cambridge Analytica data scandal for instance), there is a trend for increased data protection and traceability, illustrated by the GDPR laws (General Data Protection Regulation) recently introduced in the EU and the EEA.

This demand is particularly critical for sectors such as banking, healthcare, or the insurance industries where data privacy is essential.

To take advantage of the exponential data increase while maintaining the privacy and security of the users’ personal data, an ensemble of techniques combining machine learning algorithms and cryptography has been developed. These methods allow for the training of ML models that:

Ensure that the private data is not moved out of the user's device.
Are secure in the sense that no one can extract personal information from them.

PySyft is a python library that helps achieve these objectives.

Classic centralized learning

The classic way of learning with data from multiple users is not compliant with the privacy and security requirements mentioned above. It consists of a centralized model server that first gathers the data from each user (A.) and trains a model on the server (B.). Finally, when a user needs to use the model, the user sends a query and the server runs the model with the user’s input and sends back the result (C.).

This training process (A. and B.) is repeated as often as needed.

Federated learning

Federated learning with data from multiple users means that the model is first pushed to each user (A.). Each device trains the model locally, personalizing it based on the user’s data (B.). Finally, the different users’ updates are merged (C.) and pushed back to the server to form a change (D.) to the shared model, after which the procedure is repeated.

Each user can use its local version of the model.

There are 3 main advantages compared to classic learning:

The user data is never transmitted to the server.
The personalized model can improve the usability for the user.
No need to store data on the server, potentially reducing the costs.

A simple example below shows how to send Pytorch tensor to user devices using Pysyft (version 0.2.x). To see some more complex examples, interesting tutorials are available on the Pysyft repo.

import torch 
import syft as sy 
hook = sy.TorchHook(torch) #hook torch with syft user1 = sy.VirtualWorker(hook=hook,id='user1') #create a virtual worker a = torch.tensor([1, 2]).send(user1) #send a tensor to the user 
b = torch.tensor([3, 4]).send(user1) #send a tensor to the user 
c = a + b # add the two tensors on the user device res = c.get() # get back the resulting tensor 
print(res) >>> tensor([4, 6])

Federated learning can be used easily through Pysyft. But there are still some issues with this learning paradigm; the users of the model could potentially reverse engineer the model to extract personal data. To make sure this is impossible, PySyft implements encrypted computation algorithms.

Encrypted computation

It is possible to perform arithmetic operations (addition and multiplication) on encrypted data such that:

x + y = decode(encode(x) + encode(y))
x * y = decode(encode(x) * encode(y))

This technique can be used to do encrypted optimization of ML models. For instance, for the case of classic feedforward neural networks, as the model can be simplified to multiplication and addition (complex functions can be approximated to higher-order polynomials functions), the backpropagation algorithm can be used on the encrypted weights of the model.

SPDZ protocol

SPDZ is one of the protocols present in Pysyft (SecureNN, Function Secret Sharing are also available) that achieves encrypted computation. Each value is split into multiple “shares”, one for each user, each of which operates as a private key. All the shares should be grouped to decrypt the initial value.

An in-depth explanation of the protocol is available here. An example of a “simplified” SPDZ protocol for 3 users is shown below:

from random import randint def encrypt(num): 
    a = randint(-100,100) 
    b = randint(-100,100) 
return [a, b, num - a - b]def decrypt(shares):
   return sum(shares) # Each user has a share of the number
user1_share, user2_share, user3_share = encrypt(4)# You need all the users' shares to reconstruct the initial number
decrypt([user1_share, user2_share, user3_share])

In short, randomness makes it possible to hide the original value. To decrypt the number, all the users must agree to send their share.

Here is one way of training an ML model:

The model and the personal data are encrypted and shared between the users. We note that even though the encrypted personal data is shared between the users it is never decrypted.
The model is trained on each device. The user cannot access the weights of the model as they are encrypted.
The model is then aggregated and decrypted on the main trusted server.

The personal users’ data and the model were at all times encrypted during the training process.

A simple example below shows how to encrypt and decrypt a Pytorch tensor using Pysyft (version 0.2.x). To see some more complex examples, interesting tutorials are available on the Pysyft repo.

import torch
import syft as sy
hook = sy.TorchHook(torch)bob = sy.VirtualWorker(hook, id="bob") 
alice = sy.VirtualWorker(hook, id="alice")x = torch.tensor([25]).share(bob,alice) 
y = torch.tensor([5]).share(bob,alice)print("Bob share of x (==25): ", list(bob._tensors.values())[0]) # inspect the encrypted values 
print("Alice share of x (==25): ", list(alice._tensors.values())[0]) # inspect the encrypted values 
print("Adding both shares of x (==25): ", list(bob._tensors.values())[0]+list(alice._tensors.values())[0]) # manually decrypt the first tensor (25) by adding the 2 users shares# manually add 2 encrypted numbers 
share_bob = list(bob._tensors.values())[0] + list(bob._tensors.values())[1] 
share_alice = list(alice._tensors.values())[0] + list(alice._tensors.values())[1]print("Bob share of x+y (==30): ",share_bob)
print("ALice share of x+y (==30): ",share_alice)
print("Manual reconstruction of x + y: ", share_bob+share_alice)# add two encryted numbers using Pysyft 
z = x + y 
print("Automatic reconstruction of x + y: ", z.get())>>> Bob share of x (==25):  tensor([3254507251145357904])
>>> Alice share of x (==25):  tensor([-3254507251145357879])
>>> Adding both shares of x (==25):  tensor([25])>>> Bob share of x+y (==30):  tensor([8327705636188370707])
>>> ALice share of x+y (==30):  tensor([-8327705636188370677])
>>> Manual reconstruction of x + y:  tensor([30])>>> Automatic reconstruction of x + y:  tensor([30])

Why PySyft?

Firstly, PySyft has a simple interface to perform secure and private deep learning using federal learning and the SPDZ (pronounced “Speedz”) protocol (see above). It is an extension of the well-known machine learning (ML) library Pytorch.

Furthermore, PySyft is one of the very few packages available that allow us to securely train models.

Finally, it is an open-source python library that is regularly updated.

Conclusion

PySyft allows the training of complex ML models in a decentralized way while maintaining the information of the model private. The main drawback of such an architecture is the computation overhead, as using it almost doubles the time needed to train a model compared to a standard PyTorch execution. Still, we believe that this new ML paradigm will most likely become more and more predominant in the industry of tomorrow.