What is Privacy?

Solomon Rousseau
The Privacy Point
Published in
2 min readMay 29, 2017

This is a series of articles that examines what is privacy and how the state of the art computer science research defines privacy. To begin, we first examine what is not private.

What is NOT privacy-preserving?

First, we begin by describing a data release mechanism that might at first glance appear to be private, though is easy to defeat.

Let’s say we have a database of n individuals. The rows correspond to each individual and have the binary value of Yes or No pertaining to if the particular individual has the attribute in question. For example, suppose the question is whether or not the individual has a particular drug habit.

Suppose an analyst is able to interact with the database by issuing a query that returns an aggregate result. The analyst is also able to ask a polynomially number of non-adaptive queries. That is, the analyst might ask a query over rows 1 to (n-3), 1 to (n-2), 1 to (n-1), 1 to n.

If each individual responds truthfully and the truthful answer is stored in the database, an analyst is able to easily deduce the value of any individual by simply asking two queries over 1 to n and 1 to (n-1) and simply subtracting the difference.

Privacy Takeaways

We can see that the following must be a part of any private data release.

  1. Randomization of the aggregate output to protect against simply asking a difference of two queries.
  2. An acceptable error threshold for the aggregate output, since the randomization adds some “noise”.
  3. Anonymizing the data or rows is not sufficient against a determined attacker. It’s possible to query 1 to n and 1 to (n-1) where at least one individual is known and can be excluded.

That is, perfect privacy with exact answers to queries results in an adversary from performing privacy attacks. Thus, there must be some trade-off between privacy and accuracy.

In the literature, the approach of simply anonymizing the rows is called k-anonymity. While k-anonymity showed initial promise, it quickly ran into scaling issues. That is, as the number of dimensions increases, the more difficult it is to retain and protect privacy.

Perhaps the most serious and well known privacy breach of k-anonymity was the cancelling of the Netflix Prize. The Netflix Prize awarded one million dollars to the team that had the best machine learning algorithm that could recommend movies to watch. However, Netflix released training data that only anonymized the names of the movie reviews which unfortunately did not prevent linkability. Using publicly available data on IMDB, researchers were able to identify the Netflix reviewers 99% of the time as long as there were at least 8 IMDB movie ratings. Thus, a stronger definition than k-anonymity was required.

In the next article, we will examine a more formal measurement of privacy.

--

--