Privacy, Big Data, and the Internet

Setting the stage for this week’s Columbia SIPA conference on Internet Governance and Cybersecurity


Since last fall, I’ve been a (very) part-time senior fellow in Internet governance and cybersecurity at Columbia University’s School of International and Public Affairs. It’s a remarkable place, stacked with first-rate scholars and practitioners, home to energetic and engaged students from across the globe, and an emerging center of gravity for cross-disciplinary work on international technology policy issues. I’ve been able to spend time with faculty and researchers from across Columbia, like Matthew Waxman and Tim Wu at the Law School, Steve Bellovin at the Department of Computer Science, Eli Noam at the Business School and the Institute for Tele-Information, Agnès Callamard of the Project on Global Freedom of Expression and Information, and Emily Bell of the Journalism School and the Tow Center, as well as visiting scholars like Herbert Lin and Martin Varsavsky. I was hugely impressed by the range and depth of tech-revevant work by Columbia faculty and graduate students, most notably at the Data on a Mission Summit organized by the Data Science Institute, at which I delivered a closing keynote on ethical and policy issues triggered by big data, cheap computing power, service abstraction, and global, cross-border connectivity.

I’m grateful to Dean Merit Janow for affording me the opportunity for those interactions, the product of which will be a series of essays on Internet governance and cybersecurity that I’ll be publishing starting this week.

Reflecting its determination to making serious contributions to the field of technology and Internet policy, SIPA is hosting a major international conference later this week on Internet Governance and Cybersecurity, in collaboration with the Global Commission on Internet Governance. The line-up for the event is truly fantastic, including legendary Internet architects like Vint Cerf, policymakers like Larry Strickling, Michael Chertoff, and Christopher Painter, legislators like Marietje Schaake, administrators like Fadi Chehadé, activists like Ronaldo Lemos, Rebecca MacKinnon, Carolina Rossini, and Nuala O’Connor, investors like Brad Burnham, and business leaders and technologists from companies like Microsoft, Cloudflare, FireEye, MasterCard, J. P. Morgan, Citigroup, Verizon, and NASDAQ. All that, plus an amazing range of faculty and scholars from across Columbia and beyond.


Panel and Panelists

On Thursday, I’ll moderating a session on “Privacy, Big Data, and the Internet.” Joining me for the discussion will be:

  • Nuala O’Connor is President & CEO of the Center for Democracy and Technology. She’s been a leading policymaker, advocate, and executive focused on privacy, at places like the U.S. Departments of Commerce and Homeland Security, DoubleClick, Amazon, and GE. She’s written, spoken, and testified so much, it’s hard to know what to recommend, but for purposes of this week’s conference, a good starting point is her essay “Encryption Makes Us All Safer,” or, if you’re more of a visual learner, watch her appearance on the PBS NewsHour to discuss “Can the tech industry strike the privacy/safety balance?”. She’s on Twitter.

The panelists and I have sketched out a basic framework for our conversation on privacy, big data, and the Internet:

  • Boundaries and tensions created by Big Data among privacy, security, and freedom of expression and transaction: What are the varying approaches being taken in the world’s major economies, and what are their intended and unintended consequences? For example, can governments protect citizens and/or enforce policy preferences by requiring localization of data and the infrastructure by which it is stored and analyzed?
  • Balances vs. Trade-offs: Can privacy effectively be protected by a government without stifling innovation or killing the potential of Big Data to power improvements in health care, scientific research, transportation, environmental protection, and so forth?
  • Public vs. Private: Is there a meaningful distinction between government and corporate data sets? As they operate across borders, what are the responsibilities of corporations to protect privacy?
  • Equity vs. Neutrality: Are Big Data and algorithmic decision-making in areas like insurance, banking, hiring, evaluation, admissions, and criminal sentencing producing disparate or adverse impacts on the poor, the marginalized, the disadvantaged?

For my part, I thought it might to be useful to lay out some background thinking on the issues.


Backgrounder on Privacy, Big Data, and the Internet

What do we mean by “Big Data”?

Though somewhat fuzzy — much like “cloud computing” — “big data” typically refers to large-scale, connected database infrastructures that are (1) big in terms of the number of data points collected, and/or (2) big in terms of the analytics that can be performed.

Edd Dumbill has written a solid working definition:

Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures.

As Edd elaborates in his excellent piece, big data is commonly characterized through the three Vs: volume, velocity, and variety — the sheer size of the datasets, the speed at which data are collected and analyzed, and the wide range of data sources, formats, and file types supported.

For purposes of a discussion around privacy, I also find useful a complementary definition proposed by Kord Davis and Doug Patterson in their 2012 book “Ethics of Big Data: Balancing Risk and Innovation”:

…big data is data big enough to raise practical rather than merely theoretical concerns about the effectiveness of anonymization.

This definition crisply identifies the central source of anxiety about big data: that it facilitates irreversible privacy-related harms by distant actors that are neither transparent nor accountable.

For example, the Wall Street Journal has reported that a couple of big data firms were able to determine that the length of an employee’s commute correlates to the likelihood the worker will quit, especially in call-center and fast-food jobs. Should employers be able to make hiring decisions based on commute-based attrition risk? What if (as appears to be the case) black and Latino workers have longer average commutes, reflecting long-standing patterns of residential segregation and business siting?

Likewise, those big data firms have determined that workers who have moved frequently present greater attrition risk. Should employers be able to reverse-engineer an applicant’s past residential moves, and factor that into the hiring decision?

A similar example of the fears underlying big data analysis relates to companies that calculate credit scores: If big data demonstrates that an applicant’s creditworthiness strongly correlates with the credit scores of his/her neighbors, should the company be able to assign a score on that basis? Is it fair to penalize (or reward) people based on the (objectively correlated) behaviors of their neighbors? Should it be legal?

For a cogent argument against letting big data go wild, see Alvaro Bedoya’s “Big Data and the Underground Railroad.”

The Conventional Frame Around Big Data and Privacy

Looking around the landscape of privacy debates circa 2015, we can discern a conventional — and, I will argue, tired — frame that positions big data privacy policy as a choice between restrictions on the collection of data and restrictions on the use of data.

On the one side, the common account goes, we have privacy advocacy groups, the European Commission, and a number of European national data protection regulators. They contend that it’s essential to limit the very collection of data — that restricting the use of collected data alone isn’t enough. They point to the lessons of recent history: the horrendous consequences that flowed from the pervasive surveillance apparatus and meticulous record-keeping and dossier-building of Eastern Europe’s Communist regimes and Germany’s Nazis. In the hands of the Stasi or the Gestapo, even seemingly innocuous information could later be used to betray or condemn someone. Even here in the United States, we saw our post-Pearl Harbor Congress lift prior statutory restrictions on the use of 1940 Census records in order to identify Americans of Japanese descent for wartime internment. In later decades, we saw purportedly confidential records held by doctors, chaplains, and psychologists used to identify and take action against gay and lesbian military servicemembers. Once data is collected, they argue, use restrictions may prove ephemeral.

On the other side stand Silicon Valley and the business community, joined recently by the Obama Administration, who argue that the benefits of unfettered data collection are so astounding and important, and the harms sufficiently small and manageable, that we would be foolish to place broad, prophylactic limits on collection; instead, we should impose use restrictions wherever necessary to vindicate important public policy objectives like protecting individual privacy and preventing illegal discrimination. They point to big data-driven advances in health care, transportation, and education. For example, they note how the analysis of massive databases of anonymized medical records can enable a doctor to determine which medication is the best treatment for a given patient, based on observed symptoms and personal characteristics. They acknowledge that individual patient records are hard to anonymize, easy to de-anonymize, and can later be used to humiliate a victim through public disclosure, but argue that collection restrictions would be overkill. They contend that targeted use restrictions better preserve the benefits of big data while affording a meaningful level of protection for privacy.

The “use restrictions only” side has also argued that the benefits of big data collection will redound specifically to disadvantaged groups. For example, one big data firm has “calculate[d] that employees with criminal backgrounds are 1 to 1.5 percent more productive on the job than people without criminal records.” They point out that those with the least resources have the most to gain from big data in health care diagnosis and treatment, in reduced commute times, in effective teaching.

This latter position was well summed up in a 2014 study of big data and privacy by President Obama’s Council of Advisors on Science and Technology:

The beneficial uses of near-ubiquitous data collection are large, and they fuel an increasingly important set of economic activities. … [A] policy focus on limiting data collection will not be a broadly applicable or scalable strategy — nor one likely to achieve the right balance between beneficial results and unintended negative consequences (such as inhibiting economic growth).

Specific Concerns

In this conventional framing of current privacy debates, we can divide big data into that which is “born digital,” like email messages, and that which is “born analog”, like the recordings of security and traffic cameras, microphones, automobile GPS devices, and other sensors.

The concerns with “born digital” data are, broadly, twofold:

  • Overcollection, whereby a corporation gathers way more data than it needs to provide its service, and individuals are unaware of the nature or extent of data that is being collected. For example, the smartphone flashlight app that constantly records the user’s precise geographic location
  • Data fusion, meaning the combination of disparate data sources into profiles and tracking records that, thanks to the analytic power of big data, enable the identification of specific people with specific activities over time.

The concerns with “born analog” data are somewhat different. By definition, there is a ton of overcollection — a lot of noise with the signal. The retail security camera, for example, has to record and store video constantly, capturing the faces of every customer, in order to get the 4-second clip that allows identification of the shoplifter. A central concern here is that ever-cheaper, ever-more-powerful computing and vastly improved optical processing now allows for highly effective image recognition, killing any hope that we can maintain privacy of activity, of movement, of travel, of meeting, of assembly.

History and The Shift

Of course, the debate over collection vs. use restrictions isn’t new. Dating back to the 1960s, privacy laws — even in the U.S. — attempted to place limits on the collection AND use of data: the Fair Credit Reporting Act of 1968, the Educational Privacy Act of 1974, the HIPAA Privacy Rule, the Federal Trade Commission’s Fair Information Practice Principles.

But the weight of opinion among policymakers in Washington and beyond (though notably not in Europe) has shifted dramatically in recent years, toward the view that it’s time to give up efforts to regulate data collection, and instead place the government’s focus squarely on formulating and enforcing sensible limitations on the use of collected data.

Why? That’ll be a question for the panel. But it seems clear that policymakers, working from the conventional frame, have accepted that the benefits of ubiquitous data collection are so great, we can’t afford to impose crude, ill-fitting restrictions that are rooted in purely theoretical privacy harms. Moreover, they appear to have concluded that, practically, there’s no way to enforce collection limits in the age of ubiquitously connected smartphones with their mobile apps, social networks, location services, messaging tools, cameras, etc. That represents a vast shift from the earlier prevailing view, in an earlier technological age, that there is no practical way to enforce use limits, given the lack of visibility into what happens to data once they have entered a database: Sold? Combined with other data? Who can tell?

Backdrop

The debate over big data and privacy in the Internet era implicates a host of related policy issues, the inter-relationships among which I’ll be provoking the panel to address. They include:

  • Censorship, political and otherwise,
  • State and non-state surveillance, including efforts to amend or reform Title III of the Wiretap Act, the Foreign Intelligence Surveillance Act, and the USA-PATRIOT Act,
  • Conflicts among rules and interests across jurisdictions,
  • The relative obligations of governments and corporations in an environment in which virtually all of the communications infrastructure in question is privately owned and operated.

See you on Thursday!


Andrew McLaughlin is CEO of Digg, a partner at betaworks, and a senior fellow in Internet governance and cybersecurity at Columbia SIPA. He has previously served as Deputy Chief Technology Officer of the U.S. under President Obama, director of global public policy at Google, and Chief Policy Officer of the Internet Corporation for Assigned Names and Numbers.