Machine Learning and privacy

7 min readFeb 5, 2019

Arguing that you don’t care about the right to privacy because you have nothing to hide is no different than saying you don’t care about free speech because you have nothing to say
― Edward Snowden

In 2014, Tim received a request on his Facebook app to take a personality quiz called “This Is Your Digital Life.” He was offered a small amount of money and had to answer just a few questions about his personality. Tim was very excited to get money for this seemingly easy and harmless task. He quickly accepted the invitation, and within five minutes of receiving the request on his phone, Tim logged in to the app giving the company in charge of the quiz access to his public profile and all his friends’ public profiles. He completed the quiz within 10 minutes. A UK researching facility collected the data. Tim continued with his mundane day as a law clerk in one of the biggest law firms in Pennsylvania.

What Tim did not know was that he had just shared his data and all of his friends’ data with Cambridge Analytica. This company used Tim’s data and data from 50 million other people to target political ads based on their psychographic profiles. Unlike demographic information like age, income, and gender, psychographic profiles explain why people make purchases. Tim passively participated in one of the biggest political scandals to date with the use of personal data on such a scale.

Data has become an essential part of Deep Learning algorithms¹. Large corporations now store a lot of data from its users because it has become a central part of building better models for their algorithms to improve their products. For Google, it is essential to have users’ data to develop the best search algorithms. But as they gather and keep all this data, it becomes a liability for these companies. If a person has pictures on their phone that they do not want anyone else to see, then if Apple or Google collects that picture, their employees could have access and abuse the data. Even if these companies protect against its own employees having access to the data, a privacy breach could occur, and then hackers would have access to people’s private data.

It is very common for hacks to happen, releasing users’ data. Every year, it seems that a hack with a larger number of people occurs. So, all the data that these companies have becomes a burden. One Yahoo hack compromised 3 billion people’s accounts². At other times, data is given to researchers, expecting the best of their intentions. But researchers are not always sensitive when handling data. That was the case with the Cambridge Analytica scandal. In that instance, Facebook provided researchers access to information about users and their friends that were mainly in their public profile, including people’s names, birthdays, and interests³. This private company then used the data and sold it to political campaigns to target people with personalized ads based on their information.

Differential Privacy

Keeping sensitive data or giving data directly to researchers for creating better algorithms is dangerous. Personal data should be private and stay that way. As far back as 2006, researchers at Microsoft were concerned about users’ data privacy and created a breakthrough technique called the Differential Privacy, but they never used it in their products. Ten years later, Apple released products on the iPhone using this same method.

Differential Privacy is a way of obtaining statistics from a pool of data from many people without revealing what data each person provided. Apple implements one of the most private versions of Differential Privacy, called local model. It adds noise to the data directly in the device before sending it to Apple’s servers. In that way, Apple never touches the user’s data, preventing anyone from having access to the data except the user itself. Researchers can analyze trends of people’s data but are never able to access the details⁴. Differential Privacy allows companies to collect data from large datasets, with a mathematical proof that no one can learn about a single individual⁵. Differential Privacy does not merely try to make users’ data anonymous.

Imagine that they wanted to collect the average height of their users. Anna is 5 feet 6 inches, Bob is 5 feet 8 inches, and Clark is 5 feet 5 inches. Instead of collecting the height individually from each user, Apple collects the height plus or minus a random number. So, it would collect 5 feet 6 inches plus 1 inch for Anna, 5 feet 8 inches plus 2 inches for Bob, and 5 feet 5 inches minus 3 inches for Clark, which equal 5 feet 7 inches for Anna, 5 feet 10 inches for Bob, and 5 feet 2 inches for Clark. Apple averages these heights without the names of the users.

The average height of its users would be the same before and after adding these numbers up and equals 5 feet 6 inches. But Apple would not be collecting anyone’s actual height, and their individual information remains secret. That allows Apple and other companies to create smart models without collecting personal information from its users, protecting their privacy. The same technique could produce models about images in people’s phones and any other information.

Differential Privacy, or keeping users data private, is much different from anonymization. Anonymization does not guarantee that the information that the user has, like a picture, is not leaked or that the individual cannot be traced back from the data. An example of anonymization is to send a pseudonym of a person’s name but still transmit their height. Anonymization tends to fail. In 2007, Netflix released 10 million movie ratings from its viewers in order to create a better recommendation algorithm. They only published ratings, removing all identifying details⁶. Researchers, however, matched this data with public data on the Internet Movie Database (IMDb)⁷. After matching patterns of recommendations, they added the names back to the original anonymous data. That is why Differential Privacy is essential and necessary. It is used to prevent user’s data from being leaked in any possible way.

The Figure above shows what the usage percentage of each emoji is over the total usage of emojis. The data was collected using Differential Privacy. The distribution of the usage of emojis in English-speaking countries differs from the usage of French-speaking nations. That might reveal underlying cultural differences that translate to how they use language. In this case, how frequent they use each emoji is interesting.

Apple started using Differential Privacy to improve its predictive keyboard⁸, Spotlight, and Photos app. And, it was able to advance its products without obtaining specific user’s data. For Apple, privacy is a core principle. Tim Cook, Apple’s CEO, has time and time again called for better data privacy regulation⁹. The same data and algorithms that can be used to enhance people’s lives can be used as a weapon by bad actors.

Apple has been using the data it collects with Differential Privacy to increase its predictive capabilities in its keyboard. It helps users by showing the next word above the keyboard that should be in the text based on its models. Apple has also been able to create models for what is inside people’s pictures in their iPhones without having actual users’ data. It is possible to search for specific items like ‘mountains,’ ‘chairs,’ and ‘cars’ that are in the users’ pictures. And, all of that is served by models developed using Differential Privacy. Apple is not the only one using Differential Privacy in its products. In 2014, Google released a system for its Chrome web browser to figure out users’ preferences without invading their privacy.

But Google has also been working with other technologies to produce better models while continuing to keep users’ data private. Google started developing another technique called Federated Learning¹⁰. Instead of collecting statistics on users, Google developed an in-house model and then deployed it to each of the users’ computers, phones, and applications. Then, based on the data generated by the user or the data that is already present, the model is trained. For example, if Google wants to create a neural network to identify objects in pictures and has a model of how ‘cats’ look but not how ‘dogs’ look, then the neural network is sent to a user’s phone that contains many pictures of dogs. From that, it learns what dogs look like, updating its weights. Then, it summarizes all of the changes in the model as a small, focused update. The update is sent to the cloud, where it is averaged with other users’ updates to improve the shared model. Everyone’s data advances the model.

Not only that, Federated Learning¹¹ works without the need to store user data in the cloud, but Google is not stopping there. They have developed a Secure Aggregation protocol that uses cryptographic techniques so that they can only decrypt the average update if hundreds or thousands of users have participated¹². That guarantees that no individual phone’s update can be inspected before averaging it out, guarding people’s privacy. Google already uses this technique in some of its products, including the Google keyboard that predicts what the user will type. The product is well-suited for this method since users type a lot of sensitive information into their phone. The technique keeps that data private.

This field is relatively new, but it is clear that these companies do not need to keep users’ data to create better and more refined Deep Learning algorithms. In the years to come, more hacks will happen, and the users’ data stored to improve these models will be shared with hackers and other parties. But that does not need to be the norm. Privacy does not necessarily need to be traded with better Machine Learning models. Both can co-exist.

Machine Learning and privacy

Differential Privacy

Written by Giuliano Giacaglia