Various Approaches towards Data Privacy with Python: Differential Privacy and Homomorphic Encryption

Published in

SFU Professional Computer Science

9 min readFeb 4, 2020

Matt Canute, Young-Min Kim, Donggu Lee, Suraj Swaroop, Adriena Wong

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit here.

What is Data Privacy? Why is this important for data scientists or data engineers?

First, it’s important to note the difference between data privacy and data security, because often these terms are used synonymously with each other. While data security refers to the overall process of protecting data from unauthorized access, data privacy is a branch of data security that deals with the collection, distribution, and use of data. Consider a case where companies encrypt and protect the user data from external attackers. They have preserved data security, but may still be violating data privacy if the data was collected without proper consent. On the other hand, if companies conserve data privacy but do not secure user data reliably, they’ll probably have violated both data privacy and data security when attacked by hackers. You can have data protection without data privacy, but not data privacy without data protection.

*Table 1: Features of Data Security versus Data Privacy*

Regarding data privacy, data scientists and data engineers might often have a role in designing the data collection process of various applications, so that the proper information can be stored for effective decision analysis and model-building.

However, one trend to be wary of is that the design of today’s data-intensive applications seem to cause a chilling effect in the ways that we interact with them. Users of a tool may not feel as free to search or click whatever they want, knowing that their information has forever been logged somewhere in a database. For instance, user behaviour may have been affected by Facebook’s collection of user data from Android devices and/or the Snowden leaks. This is especially the case now that data security systems appear to be more fragile due to the frequent news reports of hacked databases from large companies, such as the recent LifeLabs hack, so it feels inevitable that our search history wouldn’t be private for much longer.

Due to these concerns, we’re seeing various levels of policy implementations to address them, as many different countries attempt to deal with their interpretation of data privacy, data rights, and data security, such as the General Data Protection Regulation (GDPR) in the European Union (EU). Here are some other examples of data regulation laws:

Australia : The Notifiable Data Breach (NDB)
Brazil : Lei Geral de Protecao de Dados (LGPD)
Canada : The Personal Information Protection and Electronic Documents Act (PIPEDA)

So, we have two issues:

A user’s actions collected from these tools are probably becoming less useful, since the user may be wary of the fact that their actions are being logged onto a server somewhere that could eventually be hacked and leaked, and thus the tools might have a skewed overall pattern of behaviour in a misleading way.
The increasing number of regulations as mentioned above are making it difficult to start building an application without knowing all the protocols to follow. For example, under GDPR, a citizen has the right to be forgotten from a database or a model, but this becomes tricky when the model is a black-box stack of neural networks, and it’s difficult to know whether their personal data has accidentally been ‘memorized’ by the model for making recommendations.

It turns out that these can both be addressed to an extent using the technique of differential privacy.

Differential Privacy

Differential privacy is a method for sharing information about the total population from a statistical database or dataset, while also withholding any private information by adding noise based on certain mechanisms. The reason we call it differential privacy is because it allows you to mask your dataset against a ‘differential attack,’ which involves combining data from multiple sources to derive information that wasn’t included in the original dataset.

Addressing issue 1: The nature of differential privacy’s technique to add noise in order to improve the collection of data has been a process used since the 1960s for eliminating evasive answer bias, which would involve the cases where users are surveyed on difficult questions, such that they’d have an incentive to lie under ordinary circumstances. For example, consider conducting a survey where you’re trying to determine the true proportion of data scientists that have felt pressure to “massage the data” in order to obtain a certain conclusion. Instead of asking them that question outright, you’d first ask them to flip a coin, and then to flip the coin again. If it landed on heads, then they can feel free to say yes or no. But if it landed on tails, then they have to use the second coin flip to determine what to say: heads for yes, and tails for no. As a result, people are more relaxed, and hence more truthful in their answers because now they can rest assured that if the survey data is ever leaked, then they have plausible deniability that they answered ‘yes’ purely because they flipped a tails and then a heads. The results can then be aggregated up to uncover the true proportion of people that have “massaged the data”, after a large-enough sample size.

Addressing issue 2: Incorporating differential privacy mechanisms into machine learning algorithms can help to prevent accidentally incorporating personal information into the model. For example, although you could remove the columns of social insurance numbers and credit card numbers from the training set of a model that you’re building, credit card numbers could be unexpectedly leaked into the model through a text feature’s bag of words representation. Using differential privacy can reduce the chances that a model will overfit to individual entries on a dataset. For example, there was a team in 2018 that improved a model which was previously leaking social security numbers, and they showed that under the differentially private model, no such leakage occurred, while the accuracy performed nearly as well [1].

Experimenting with Differential Privacy applications in Python:

Here’s how this works.

The key to achieve differential privacy is to add a random noise to the ground truth. There is always a fundamental trade-off between the noise (ε) and the accuracy in a differential privacy algorithm. For example, a dataset consisting purely of random noise will obviously maintain complete privacy, but it won’t be useful. Epsilon (ε) is the metric determining privacy loss when there is a differential change, such as adding or subtracting a datapoint. A smaller ε will generate higher privacy protection, but lower accuracy. Accuracy is the degree of similarity between the output from the differential privacy algorithm versus the pure output.

There are two primary noise mechanisms in differential privacy, namely the Laplace mechanism and the exponential mechanism:

Laplace Mechanism:

The Laplace mechanism is used when the output is numerical. The Laplace mechanism will compute a function f that maps a database query to k real numbers. It will subject each datapoint to some noise taken from the Laplace distribution, Lap(x|b) where b is a scaling variable and x is a database, which is a symmetrical version of the exponential function. More specifically, the noise variables will be scaled based on the sensitivity of f (divided by ε), denoted by ∆f/ε. The mechanism, denoted by ML(), is defined by

where Yi are identically, independently distributed random variables drawn from Lap(∆f /ε) [2].

Exponential Mechanism:

The exponential mechanism is another security-controlled plan to achieve differential privacy when the outputs are categorical. The mechanism selects the best response from an abstract non-numeric range R that has the maximum utility score. This is based on a utility function u() that maps from database x and output r ∈ R pairs to a utility score. More specifically, u(x, r) represents how good the output r is for database x. The exponential mechanism outputs every r∈ R with probability proportional to exp(εu(x, r)/∆u) where ∆u is the sensitivity of u with respect to x defined by:

where x, y are databases [2].

Thus, the privacy loss function [2] can be represented by:

Some examples of applications that you probably use today that uses differential privacy:

Apple IOS10 onwards uses a differential privacy implementation to help protect the privacy of user’s activity, while gaining insights that improve the usability of features such as emoji suggestions and lookup hints.

Similar to Apple’s technique, Google uses a differential privacy tool called Randomized Aggregatable Privacy-Preserving Ordinal Response (RAPPOR) with Google Chrome to prevent sensitive information like a user’s browsing history from being traced. As Google Chrome is one of the most widely used browsers, it is susceptible to attacks from malicious software. Google collects usage statistics from users who agreed to send their data in and uses RAPPOR to draw overall insights from the data while maintaining each individual user’s privacy.

Coming soon: A Current Development in Data Security: Homomorphic Encryption

Currently, today’s encryption algorithms are very reliable due to the amount of processing power required to crack it which is both highly costly and time-consuming. One popular encryption method is the RSA (Rivest-Shamir-Adleman) cryptosystem which uses a public encryption key and a private decryption key, and relies on the factoring of the product of two very large prime numbers. Unfortunately, in order to perform any data processing or analysis, it is necessary to decrypt the data first which may jeopardize any private information such as sensitive financial or medical information. One solution is fully homomorphic encryption, which is an encryption method that uses an algebraic procedure to allow the data being encrypted to be manipulated and processed without first decrypting it. The key advantage is that after performing calculations on the encrypted data, we can still effectively query the encrypted databases for analysis. When the encrypted data is decrypted, the same analyses performed on the decrypted data would yield the same results. This is extremely useful since it allows a third party to perform computations directly on the encrypted data without the need for decryption, thus keeping the contents of the data private.

*Figure 1: The general flow of data when using homomorphic encryption. In the end, the encrypted answer and decrypted answer should reveal the same results.*

Table 1: The different types of homomorphic encryption (HE)

Unfortunately, fully homomorphic encryption is still being developed as it’s currently impractically slow due to its large computational overhead. It also can only support addition and multiplication as other operations, such as sorting or regular expressions, would be too complex for its current implementation.

Although the first fully homomorphic encryption scheme was created in 2009 by IBM research Craig Gentry, it wasn’t until 2016 that IBM came out with the first version of HElib library, an open-source homomorphic encryption library in C++, which unfortunately performed significantly slower than plaintext operations.

In 2017, Microsoft’s Research Group released Microsoft SEAL (Simple Encrypted Arithmetic Library) in C++ which offers encrypted data storage and computation services using homomorphic encryption. In an effort to make this library more practical for data scientists and other engineers, Lab41 and B.Next have created a port of the Microsoft SEAL library to Python, called PySEAL, which can be easily imported into Python projects. This link shows an example of how PySEAL works.

In 2019, Google also rolled out with an open-source Private Join and Compute functionality that makes use of homomorphic encryption to allow two users to privately do certain aggregate ]calculations on two joined datasets without knowing the contents of the inputs.

Conclusion

As data privacy becomes an increasingly popular topic in the news and within the field of data science, companies will need to consider having a safeguard for keeping their client’s personal data safe in case of a data breach. We have introduced two methods of protecting private data while still maintaining the ability to draw analyses from it: differential privacy and homomorphic encryption.

References

[1] Rachel Cummings, Deven Desai. The Role of Differential Privacy in GDPR Compliance. https://piret.gitlab.io/fatrec2018/program/fatrec2018-cummings.pdf, 2018.

[2] Cynthia Dwork, Aaron Roth. The Algorithmic Foundations of Differential Privacy. Now Publishers Inc, 2014.

Various Approaches towards Data Privacy with Python: Differential Privacy and Homomorphic Encryption

Differential Privacy

Coming soon: A Current Development in Data Security: Homomorphic Encryption

Conclusion

References

Written by Adriena Wong