Privacy is dead, but not because Scott McNealy said so.

tl;dr The idea of privacy has driven much of our concerns over data for the last few years, and has been the driver of extremely successful research efforts, most notably in differential privacy.

What we’re seeing now though is a pivot away from the undifferentiated and problematic notion of privacy in data towards a more subtle and nuanced notion of how data is actually used, and how it should be used.

Privacy is dead. Get over it. — Scott McNealy
The ambitious genetics startup [23andme] has received a first payment of $10 million from biotech company Genentech in exchange for genetic data donated by customers of 23andMe’s personal genome service. — The Verge

Once we started collecting data on people to monetize it, privacy (especially online) became a major concern: the wave of deanonymization attacks that started with Latanya Sweeney’s pioneering work illustrated the power that we could wield with data, and the dangers of it.

While privacy as a research enterprise has been a major activity in all aspects of data analysis for very long time, Sweeney’s work kicked off a new wave of interest in the topic. Whether it was private information retrieval, or secure multiparty computation, or even differential privacy, the idea of keeping one’s information private while living online became a major theme of research in the area.

Research is all well and good. But (data) privacy is one of those things that it’s hard to get the general public up in arms about. Every time Facebook changes their privacy policies there’s a hue and cry among tech journalists, but it’s also quite clear that the average Facebook user just can’t bring themselves to care.

Why is this ? Privacy protection is not just something that we should leave to the EFF and the ACLU to worry about. Contrary to what Eric Schmidt and the NSA might think, privacy isn’t something to worry about only when we do something wrong. For a casual take on this, see the ACLU’s Invasion of the Data Snatchers video, and for an excellent in-depth look at issue of privacy rendered in the form of a graphic novel, look at Terms and Conditions.

But a2014 Pew Research survey on public perception of privacy in the post-Snowden era indicates that apathy is not the issue here. Rather, the public has a far more nuanced understanding of privacy and its tradeoff with utility than perhaps we are given credit for. For example,

61% of adults “disagree” or “strongly disagree” with the statement: “I appreciate that online services are more efficient because of the increased access they have to my personal data.”
At the same time, 55% “agree” or “strongly agree” with the statement: “I am willing to share some information about myself with companies in order to use online services for free.”
“perfectly targeted ads are just information”

I think web users have picked up unconsciously on something that should be made explicit.

Privacy (especially data privacy) is a concept that needs unbundling

There are many “contexts” that surround any question of privacy, and depending on the context, a “leak of data” wouldn’t necessarily be construed as a “violation of privacy”. As Dave Winer once put it, “perfectly targeted ads are just information”.

None of this is new in the legal space: the existence (or not) of a right to privacy and what that means in specific contexts is being debated endlessly in legal circles and in the court system.

But because of the new opportunities (or should I say “challenges” ?) presented by the mountains of data available and being collected online, what we are now seeing is the unbundling of the notion of data privacy and a shift in focus away from “did someone learn about my private information” towards “did someone do something inappropriate/unethical/illegal with my information”.

I should say: this shift is not a replacement. I’m not saying that we don’t care about protecting our data. But the binary question “is my data safe or not” no longer seems to be a useful question or even the right question.

My understanding of this shift started with a paper by Benjamin Wittes of the Brookings Institute back in 2011. In it, he coined the term ‘Databuse’:

The relevant concept is not, in my judgment, protecting some elusive positive right of user privacy but, rather, protecting a negative right — a right against the unjustified deployment of user data in a fashion adverse to the user’s interests

The Wittes article is worth reading even just for his explanation of the taxonomy of privacy notions first developed by Solove. He illustrates the many different (and often subtlely different) concepts swirling around when we talk about privacy and argues convincingly that these notions are both legally incoherent and inappropriate for talking about data.

In a followup article, Wittes and Wells C. Bennett take on the task of constructing a set of policy recommendations out of the concept of databuse. They identify three different categories of data use:

  • a first where both the data provider (the consumer) and the agent using the data (typically a corporation) benefit from the sharing of data (when amazon gives me targeted recommendations for example).
  • a second where the corporation benefits and the consumer incurs no harm. An example of this is advertising targeted at a user based on profile information.
  • a third where the user incurs direct harm. They argue that this is the easiest case in which to argue for government intervention.

There’s a quibble one could make right here. Category II data sharing via advertising might be benign (for example if ads are targeted to you based on your preferences). But they can also veer into the territory of abuse. Consider two examples (both of which I heard of during talks at FATML)

  • You’re presented with targeted ads based on your Google ads profile. If you’re male and appear to be looking for jobs, you’re presented with ads for high paying careers or opportunities to prepare for such careers. If you’re female, you’re not.
  • You’re placed in a certain racial category by an automatic system that infers race based on your first name, and then you’re presented with ads that are biased based on that categorization (this is the famous new work by Latanya Sweeney).

In both cases, one might try to argue that there’s direct harm to the user and is therefore a case of category III, but really this is targeted advertising that falls squarely in the realm of category II. The problem here is that data is being linked before recommendations are made: this linkage is what causes problems.

There’s another problem arising in the category II data sharing scenarios. Wittes and Bennet argue that perceived privacy violations are driven by “privacy as sentiment”, where a consumer might feel unease or distaste over the use of their data. They reject this premise as a basis for policy: to use modern vernacular, what they object to is legislation or regulation based on “teh feels”, or the emotional reaction to data sharing.

Which brings me to the another important work in this area that tries to unpack the notion of privacy. Helen Nissenbaum is a philosopher and communications theorist at NYU who has developed the fascinating idea of ‘privacy contexts’. At the risk of summarizing a complex web of ideas, what I distill from her work is that we inhabit various contexts in our social interactions. Data sharing within a context is not considered a breach of one’s privacy, but sharing outside the context in a way that’s out of our control is what causes us to balk.

For example, you wouldn’t object to sharing personal medical information with your doctor, But if you now meet that doctor at a party and they share that information in public, that’s not going to make you very happy.

The problem this poses for standard notions of privacy preservation is that you voluntarily released the sensitive information yourself, so there’s no issue of anonymizing the data or ensuring differential privacy. The problem is that the data has shifted contexts, and that’s where the databuse arises.

In a sense, Nissenbaum’s work provides an answer to a complaint posed by Wittes and Bennett. They object to claims of privacy violation based on sentiment and reject the idea that we should try to examine these violations more closely. Her work provides a working definition of what it means to incur such privacy violations in terms of an out-of-context use of data and suggests that maybe the way we share data should be contained within contexts with rules about how these contexts can leak data to each other.

The shift in focus from privacy (“keep my data secret”) to more general questions of databuse (“don’t do bad things with data”) is embodied by work that Cynthia Dwork has been doing. She is of course one of the founders of differential privacy, but she’s also been heavily involved in the shift away from privacy and towards issues of fairness and accountability in data mining. Apart from the work on fairness through awareness that I’ve mentioned before, she’s also written a position piece with Deidre Mulligan for the Stanford Law Review titled “It’s not privacy, and it’s not fair”, outlining this evolution.

There’s a strain of “if you don’t want to play, go home” in responses to concerns about privacy and (now) databuse. The sense is that if you’re willing to share information about yourself for your benefit, then you deserver whatever else happens after that. If you post pictures of your late night bacchanalia on Facebook, don’t be surprised if a future employer looks askance at you, and so on.

This is in my view a untenable position to take. Unless you plan to become the digital Amish and to move to the town in West Virginia where wifi doesn’t work, there are too many ways in which shared data is woven into the fabric of your day to make “going off the grid” a practical solution to concerns about databuse. That cat is well and truly out of the bag.

And don’t get me wrong. I like it that Google remembers my recent requests for directions and that Google Now immediately pops up my commute. Amazon recommendations are immensely useful. But there’s no reason to have to live with “all-or-nothing” solutions. There is a wealth of new questions to ask and answer that require us to think very carefully about the ways in which we share data, how it can be (mis)used, and how we can prevent this from happening.

Postscript: I sent this article out for comment prior to publishing it. One reaction that I got was “well that’s all well and good, but what do you want to do about it ?”.

It’s a good point. And I don’t think there’s even one concrete action we can take to prevent databuse right now: the problem is large and multifaceted.

But there’s interest in a number of questions that would help us understand how to build systems, policies and algorithms to regulate and monitor the use of our data. And maybe that’s the topic of another post.

Show your support

Clapping shows how much you appreciated Suresh Venkat’s story.