Data Privacy and Machine Learning — The clash of Titans

Podcast series on Machine Learning and Data Privacy

Episode 1

Hi friends,

Welcome to the new podcast series on Data Privacy for Machine Learning, powered by the MLOps community and sponsored by YData. YData helps adopters of AI to improve and generate high-quality data so they can become tomorrow’s industry leaders. My name is Fabiana Clemente and I’ll be your host throughout the series. I promise you, great guests, to help me out explaining some of the most trending topics around Data Privacy and AI such as Differential Privacy, Synthetic Data, or what are regulations saying about data privacy for machine learning.

And if you’re not yet a member of our amazing MLOps community, I have no idea of what you’ve been doing! Join our slack and follow us for some amazing content and webinars with spectacular speakers every Wednesday, on topics that will step up your ML game.

Let me also introduce myself — I am a Co-Founder at YData. We provide the first privacy by design DataOps platform for Data Scientists to work with synthetic and high-quality data, where I lead the data science team. As a data scientist myself, currently working with data synthesization, I’ve experienced many challenges while working in this field, from poor quality data due to collection issues, to blocked access due to privacy constraints. And I bet some of you have or are experiencing them too!

Let’s jump into today’s episode is The clash of Titans — The impact of Machine Learning in data privacy.

It’s not surprising to see both of these concepts in the same sentence, well in fact, for most of us used to work with data, this is pretty common.

But first things first after all, how can we define data privacy?

In a nutshell, data privacy is the right for individuals to have control over their own personal information that gets collected and further used. Due to the increasing amount of collected data and sophistication of the systems, many privacy laws have been approved in the last years, being the European GDPR probably one of the most restrictive, which changed the relationship between citizens, their data, and companies.

I’m not going to bore you by going into too much detail about the privacy regulations, just going to enumerate the most relevant or famous ones, such as the GDPR, or General Data Protection Regulation, in Europe starting from May 25th of 2018, HIPAA, the Health Insurance Portability and Accountability Act signed by President Bill Clinton in 1996, the new CCPA also in the USA started in the beginning of 2020, Brazil has just launched their LGPD which is similar to the GDPR and some other countries are currently working on their own versions of the privacy regulation. Fun fact for the european GDPR, there’s this website: enforcementtracker.com that keeps track of fines applied, being the British Airways leading the ranking with a 200 million euros fine in 2019.

Well ok, I understood data privacy, but what does it have to do with Machine Learning?

Sophistication was the key to the birth of this beautiful relationship! Nowadays, we can count on systems that not only allow us to store major volumes of data but also to have analyzed and information extracted from. Machine Learning has been playing a massive role, leading to innovation and development in many business areas.

Right, but I have all the sensitive information masked already, it’s impossible to have data leakages?

I don’t know about you guys, but I have heard this one over and over again, and that’s alarming! At least for me, as I’m a consumer as well, and hearing that probably my data is not that “private” makes me think twice before giving my consent to have my data stored and used.

No, I’m not talking about the Cambridge Analytica scandal, although if you haven’t heard about it, where the hell have you been lately!? Moving on … I’m talking about incidents such as the Netflix prize or the Strava military data leakage.

Yeap … that’s right! You that have been just Netflix and chill and using Strava to share with your friends how much you have improved your running skills since outdoor exercise has been allowed during the pandemic, you are probably eroding your privacy!!

Just kidding, we don’t need to be so drastic! Let me explain better what I meant to say, and the link with Machine Learning. And please don’t stop using those apps because of these incidents, afterwards, brands have taken precautions and reinforced their security and data privacy policies.

The Netflix Prize was a particular popular contest launched by Netflix, with the aim to challenge Data Scientists to improve the accuracy of their recommendation system. Well, the prize was quite high and juicy, which attracted a lot of attention, even from lawsuits :D! It seemed that even with no personal information disclosed and as little as just the movie ratings, along with the rating data and a unique ID number for the subscribers, it was possible to re-identify a lot of the users in the database. And how was this possible? Simple by leveraging Machine Learning and an open external database, which, in this case, was the Internet Movies Database, most known and IMDB.

The Strava PR nightmare is a bit more recent. To be honest, I’ve never had the habit to use an app like this to follow my performance while doing exercise, but I know that many of my friends do it! Surprisingly enough, these apps can tell a lot about us and the surroundings, things like where do we live, or we usually go to practice outdoors activities or even, they can be used to map a lot of areas without the need of a satellite or a car to do it.

In 2017, Strava released a searchable heatmap of publicly logged activities. It’s not surprising that not long after it’s release researchers were able to reveal the location of sensitive sites such as US military bases in countries such as Afghanistan and Syria, as well as the exercise routines of their occupants! The moral of the story, location data, even without sensitive data, combined with Machine Learning can expose a lot about us, from where we live, our points of interest, routines, and habits!

In a nutshell, ML & AI are able to compute scenarios that combine hundreds of input dimensions/features in extremely complex ways, resulting in privacy-compromising indirectly and in many different unthinkable ways. There are other amazing examples, that I could mention, just to prove to you that Machine Learning and Big Data have definitely come to shake the concept of data privacy and protection that we were so used to!

In fact, you can be surprised with how AI and ML can be leveraged to attack data and systems.

Adaptive malware

Just like security features such as authentication have become ‘adaptive’ via the use of AI & ML, so have the cyber attacks. Malware that has machine learning capabilities at its disposal can ‘learn’ what might and might not work in an environment and morph itself to get past defenses and infiltrate systems.

Machine Learning can break crypto

Much of information security today is built over a foundation of cryptography. Many of the cryptography primitives are based on algorithms designed by cryptographers with a goal to achieve ‘good enough’ security but not perfect. In theory, cryptos are based on randomization, nevertheless this in practice, this is not totally true, and nowadays we have ML models backing attacks to cryptosystems, speeding up the brute force attempts.

Wright, but I have ensured my data full security? Isn’t that enough?

This another one that is pretty common to be heard across different organizations. It is very common to see the concepts of security and privacy combined, but are they synonyms or do they stand from slightly different concepts? Although they share things in common, they are indeed different, let me elaborate:

For example, sometimes we see privacy being confused with security. Data leakage is a security breach that can lead to privacy leakage if sensitive information was involved.

  1. Data privacy is related to the proper handling of data- how it is collected, how it is used, how can be kept in compliance.
  2. Whereas data security is all about the access and protection of data from unauthorized users through different methods, such as encryption, key management, or authentication.

Meaning that even though all your data from production environments is secure, it doesn’t mean that you’ve ensured it’s privacy, especially when you’re looking to leverage Machine Learning techniques to extract new information from your data!

Does this mean that Machine Learning and AI are incompatible with data protection? Can regulations dictate the end of AI?

Well, that’s too drastic, and I dare to say, not true! It’s true that Machine Learning amplifies Data Privacy issues, but it is possible to overcome some of them and keep benefiting from the wonders of Artificial Intelligence, those solutions have a name, Privacy Enhancing Technologies! But are they a myth, do they promise more than what they deliver? Is there any solution that solves it all? How can I benefit from them and when to apply one or the other?

Stay tuned, as all of these questions will be answered in our next episode — Are privacy-enhancing technologies a myth for Machine Learning?

But because I know you’re really anxious to know the answer, here is a spoiler: with privacy-enhancing technologies, you CAN leverage machine learning without compromising data privacy. In the next episode, I’m going to dig into each one of the most common techniques, so it is an episode you can’t miss!

We’re reaching the end of today’s episode, but before saying goodbye, I bring you an extra.

Personally, I like to keep up with the most amazing projects that are being developed, which led me to create this section of our videocast: The open mic.

Today I bring you the Tensorflow Privacy package, which falls perfectly into the topic that we’ve discussed today.

Tensorflow Privacy

Thinking about training on private data while you need to have your model deployed in the public domain? Dealing with the slight possibility that your model might memorize some of the specifics in the training data and you might end up publicizing things which were supposed to be private? Modern machine learning is increasingly applied to create new technologies and user experiences, many of which involve training machines to learn responsibly from sensitive data, such as personal photos, email, or text conversations. Ideally, the parameters of trained machine-learning models should encode general patterns rather than memorizing specific training examples. Nevertheless, there’s always the risk of exposing unwanted information. That’s when tools such as Tensorflow Privacy kicks in. In a very easy way, Tensorflow Privacy allows you to design and build your Deep Learning models while ensuring privacy at the level of the model weights, as it provides an implementation of Differential Private Stochastic gradient optimizers. Learning with differential privacy provides provable guarantees of privacy, mitigating the risk of exposing sensitive training data in machine learning.

In case you’re wondering how you can use TF privacy, I’ve linked a full tutorial on how to build a Machine Learning model in the description box.

Conclusion remarks and next session

That’s it for today! Now that we are aware of the challenges that AI brought to Data privacy, in the next episodes, we will be covering topics from the impact of privacy regulations in AI to Privacy Enhancing Tech benefits and how companies are leveraging them today! I hope you enjoyed it, and don’t forget to subscribe to the podcast and to join our community slack!

--

--

--

A podcast series on Machine Learning and Data Privacy, counting with great guests, from experts in Ethics and Privacy to privacy-enhancing tech lovers.

Recommended from Medium

Data Lake vs. Data Warehouse: Which Is the Best Data Architecture?

CXL- Growth Marketing Week 8.

TITANIC DATASET

Evaluation of Linear Regression Models

Data Science at Uber

Getting baseline risks for case-control studies

Binary and n-Ary Search Algorithms.

Word2vec with PyTorch: Implementing the Original Paper

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Fabiana Clemente

Fabiana Clemente

Passionate for data. Thriving for the development of data privacy solutions while unlocking new data sources for data scientists at @YData

More from Medium

The various stages of an AI project flow

Why You Need ML Monitoring

What is AutoML (Automated Machine Learning)? — A Brief Overview

Need for speed for Artificial Intelligence, Machine Learning & Data Analytics