Digesting the White House Big Data Reports

The Consensus of Disruption

VJ Kapur
8 min readMay 8, 2014

Last week, two Big Data reports were delivered to the President in response to his call for a 90-day study:

Big Data: Seizing Opportunities, Preserving Values (announcement, fact sheet, full report [PDF]) focuses on the privacy impacts of data collection and analysis. The report was prepared by special advisor John Podesta in cooperation with the White House’s Office of Science and Technology Policy (OSTP), the National Economic Council, and the Departments of Commerce and Energy.

Big Data: A Technological Perspective (announcement, fact sheet [PDF], full report [PDF]), focuses more squarely on the commercial technology driving the Big Data movement. The report was prepared by the President’s Council of Advisors on Science and Technology, an advisory group of scientists and engineers from across academia and industry.

I should start by saying that there are a lot of things I liked about these reports (is that foreshadowing that I’m about to dwell on disappointment?). The reports do a great job of outlining the complex economic, social, and legal implications of growing technology adoption with due consideration for our mixed history, spectrum of values, and future outlook. The Podesta report also includes a comprehensive look at some of this administration’s underappreciated data dissemination initiatives and poignantly reiterates 2012's Consumer Bill of Rights [PDF]. Even the Electronic Frontier Foundation had some positive reactions.

Perhaps most importantly, the reports are an aggressively researched reflection on whatever possible commercial+governmental+academic consensus exists on what “Big Data” is and can do. Unfortunately, as I’ve snarkily insisted, this consensus leaves something to be desired.

Classically, the reports have their share of allusions to the three V’s and a catch-all juxtaposition to the “traditional modes” that Big Data diverges from:

For purposes of this study, the review group focused on data that is so large in volume, so diverse in variety or moving with such velocity, that traditional modes of data capture and analysis are insufficient… (Podesta, pg. 4)

The PCAST report goes a little further by incorporating an incomplete list of non-“traditional modes,” some close to 20 years old at this point:

…while computer scientists reviewing multiple definitions offer the more technical, “a term describing the storage and analysis of large and/or complex data sets using a series of techniques including, but not limited to, NoSQL, MapReduce, and machine learning.”[5] (PCAST, pg. 2)

As is in vogue, they tenuously associate all that data to loosely related topics of more certain value:

Big data is big in two different senses. It is big in the quantity and variety of data that are available to be processed. And, it is big in the scale of analysis (termed “analytics”) that can be applied to those data, ultimately to make inferences and draw conclusions. (PCAST, pg. x)

All well and good, but allow me summarize the “Big Data” issue in five bullet points:

  • data is useful, but can be harmful; this has been the case since, presumably, proto-humans etched mammoth migrations on cave walls to strategically schedule mammoth hunts, and leaks prompted other proto-humans to plan cave raids around the same time
  • since then, we’ve had various periods of cultural/scientific progress wherein we’ve collected and/or communicated significantly more information as a species than in times before; there have always been governance hurdles in maximizing the rewards and minimizing the risks of this progress
  • with more and more digital devices always on, always on us, adorned with sensors, and invisibly linked at all times to virtually-infinite data repositories that losslessly capture and persist all of our thoughts and experiences, we are probably within the most significant of these periods so far, continually for the foreseeable future
  • we can apply math to this data in less time at lower cost than we previously could, due to innovations that were pretty much a waste product of all the straight-up sorcery mentioned in that last bullet point
  • we’re not entirely sure what kind of rewards and risks we’re talking about at this point because the sheer magnitude of this math is more than we can readily comprehend; we have some preliminary ideas based on anecdotes of data usage and a long tragic history of mistreating people based on superficial knowledge about them, but we’re really only scratching the surface.

No, really, that’s right about where we are with this. I propose we call this notion the data singularity.

…one can never know what information may later be extracted from any particular collection of big data, both because that information may result only from the combination of seemingly unrelated data sets, and because the algorithm for revealing the new information may not even have been invented at the time of collection. (PCAST, pg. ix)

And it’s important not to let our old-fashioned notions of hypothesis testing get in the way here:

In the past, searching large datasets required both rationally organized data and a specific research question… big data analytics enable data scientists to amass lots of data, including unstructured data, and find anomalies or patterns… in order to find the needle, you have to have a haystack. (Podesta, pgs. 6-7)

Did you get that? Presumably, we’ll figure out some questions after burying ourselves in a sufficiently large collection of answers.

How large is sufficiently large, you ask?

Shut up. That’s how large.

source: http://cheezburger.com/3415087616

There are three assumptions underlying this mentality, none of which I’m sure I can confirm or deny at the moment, but all of which I’m skeptical of:

  • there is an “inflection point” in data volume past which conclusions spontaneously appear (admittedly, they had an anecdote [PPT] for this)
  • even past the inflection point, when evidence is firmly in hand, adding data to the pool always clarifies this evidence
  • we won’t know if there are diminishing returns on this pursuit of more data, because diminishing returns today may be a shortcoming of our techniques that we’re likely to overcome down the line; potentially, we will only overcome if we collect the data to develop those techniques in the first place

The question of whether more data is always better generally results in concurrence by startup execs and skepticism by statisticians, a debate that these reports didn’t really get into.

As part of its 90-day study, the Podesta commission met directly with industry stakeholders, held academic workshops, and solicited wide written participation from the world at large through a Request for Information; a comprehensive list of participants can be seen in Appendices B through E of that report. You may (but probably won’t) notice Intrical listed at the bottom of page 77. Our RFI response addressed a few different topics and many of our thoughts were consistent with the presented consensus. However, the reports overlooked one of our major concerns, and may very well have exacerbated it with nods to a data singularity. Excerpting our RFI response:

Big Data’s biggest impact on privacy may be cultural rather than technological. Increasingly, data logged by routine processes such as digital communication, digital account management, digital media consumption, etc. is seen as a commodity for acquisition and ingestion for purposes including scientific research, commercial research, consumer targeting, law enforcement, and national security. This culture has produced an arms race[1] with higher volume seen as a success metric[2], which creates a decoupling between the drive to acquire/digest this data and the measurable utilization of this data. This, in turn, leads to an indiscriminate spread of data across organizations. Unfortunately, consumers are left with little control over this data, or appropriate awareness of its implications. Vendor privacy policies are rarely well considered or well understood. Meanwhile, national security initiatives are necessarily opaque to minimize potential for malicious subversion.

Cases where Big Data has made marked contributions in application are scarce. Most stories of Big Data success are anecdotes with little hard evidence that the Big nature of the Data was inherent to the value produced[5][6][7]. For this reason, we believe developing a rigorous method for measuring the success of Big Data programs should be a major near-term goal. Big Data projects that possess any privacy concerns, but lack appropriate analysis to justify introducing those concerns, should be heavily scrutinized by public policy.

The Government should follow precedent set by the Belmont Report and establish ethical principles and guidelines for public and private institutions utilizing Big Data. Guidelines developed in consultation with industry experts would serve not only to to protect consumers, but also to establish best practices and a quality standard by which Big Data solutions are evaluated.

We recommended developing a system of tiers featuring differing levels of guardedness/scrutiny to justify both collection and investigation, which takes into account factors such as:

  • the nature of the investigation, i.e. scientific/medical, commercial, or national security-related.
  • the amount and nature of consent present by individuals described in any way by the candidate data
  • the nature of the attributes in the data, i.e. biographical data, location data, consumption history, medical information, personal communication metadata, or personal communication content
  • provenance of the data, as determined by the nature of the source
  • the number and nature of de-identification processes the data has been through (though the reports consider de-identification an imperfect solution due to theoretical ease of re-identification)

The general public’s shaky grasp on self-regulated privacy policies was certainly addressed by both reports, replete with tangible policies and guidance for their technical implementation. However, they stopped short of making recommendations on focus and discipline in data collection and analytic investigation, and standards against which we may justify collection and investigation for the multi-dimensional spectrum of data-driven pursuits. The topic is breached in the Podesta report’s concluding question, and only two of the above factors are considered:

Should there be an agreed-upon taxonomy that distinguishes information that you do not collect or use under any circumstances, information that you can collect or use without obtaining consent, and information that you collect and use only with consent? How should this taxonomy be different for a medical researcher trying to cure cancer and a marketer targeting ads for consumer products? (Podesta, pg. 56)

Leaving this as an open question is better than not acknowledging the spectrum at all. Hopefully, there will be progress forward on such a framework now that these reports are out there. Ideally, we will research and debate what postures of collection and investigation are fruitful and which are a waste of our still-limited resources, and consider the societal value of various types of investigative pursuits. Such a disciplined approach wouldn’t only be better for privacy, it could also result in better science.

--

--

VJ Kapur

computer scientist, musician; Principal Engineer at Intrical (http://intrical.us), drummer/composer in Strange Victories (http://strangevictories.com)