An Obligatory Post on Big Data

Because Disruption

VJ Kapur
5 min readApr 15, 2014
(xkcd.com, #1289)

As a competent technologist, finding substantive content on the topic of “big data” is… difficult. Immerse yourself in the “literature,” and you’ll be bombarded by:

Pseudo-substantiated anecdotes. Big data saves lives, predicts elections, and also determines elections! Except that it doesn’t, doesn’t, and doesn’t. At least not conclusively to date.

Revolution! Disruption! Nevermind that data management, data mining, and nonlinear system identification, among many other topics in play here, have been studied for a very long time. Even business intelligence professionals can attest that most of this stuff is old hat. And for that matter, there’s…

Caveatless conflation with other topics. If we’re talking about data mining, are we necessarily talking Big Data? The converse? Can we talk Big Data without also discussing distributed computing? Have we considered that Big Data summaries are presented by the same charts and graphs as little data, making the phrase “big data visualization” a little murky?

A nod to log files. Did you know that there’s a lot of data in our modern world? Your phone creates data all the time. Also, your iPad. And all the sites you use. You’re a walking machine, producing an endless string of line items, and whoever parses them all will probably control your soul.

Death proclamations, because the term “Big Data” is so 2010. These, I’m sorta sympathetic to, but there’s (probably) more to that discussion than semantics.

Definitions that strive for obviated wins. We all know Big Data is big data. Most of us have probably also heard Gartner’s long-standing definition for what specifically (read: only slightly less generally) makes big data Big Data: the three V’s of volume, velocity, and variability. These are pretty agreeable, mostly because they straightforwardly describe data, and not anything extracted from the data. Dig deep enough, and you’ll encounter many candidates for a fourth V: value, veracity, or other Wanna-V’s. The fourth V betrays the biggest problem with the Big Data obsession: the struggle to describe Big Data as something intrinsically useful.

What you won’t find much of, however, is:

Computer Science. The harbinger of the Big Data “revolution” was likely a paper about Google-proprietary methods for processing large amounts of data on large, flexibly allocated clusters of relatively cheap computers. The eponymous MapReduce paradigm was subsequently implemented in Java and released to the open-source community as Apache Hadoop. There were other papers, and other companies releasing other projects in paper, code, and proprietary API form (i.e. Facebook, Yahoo, Amazon, NSA) but most roads lead back to Hadoop.

Pull at this thread a little more, and there’s an endless string of good questions that, were there a push for consensus, could really help pinpoint what is and isn’t exciting about Big Data on a theoretical level. Does MapReduce necessarily imply Big Data? Vice versa? What are exceptions to this equivalence? What do competing flavors of BigTable-inspired noSQL database provide that a sharded MongoDB instance doesn’t? What do those competing flavors each even bring to the table amongst themselves? Is non-trivial noSQL implementation intrinsically “Big Data”? Can relational databases ever be “Big Data”? What if they’re being run on really, really awesome hardware? If I pull a data management application off a Hadoop instance running on commodity hardware onto the Oracle 12c install on a decked out Oracle appliance without losing any functionality, have I proven the former to not be “Big Data” after all?*

Is the critical component of Big Data in the data ingestion? In the storage? In the indexing? In the speed or complexity of the processing? In the visualization of the results? All of the above? Some of the above?

I could go on. You could argue that none of these things are relevant: the mechanics, regardless of whether we discuss them in terms of code or computational theory, are less important than the data-related functionality they facilitate. Maybe so, but then that functionality is in the…

Statistics. Perhaps the most dubious claim implicit in the Big Data rhetoric is that harvesting oversized datasets is guaranteed to deliver useful results. Minus any qualitative narrowing, straight correlative operations may find just about anything. Big Data apologists may stand to remember that correlation does not imply causation, and statistical frequentism is not hot among contemporary statisticians (the jokes are vicious).

There’s also reductionism inherent in a Big Data solution. In order to whittle down an enormous dataset into something workable, there’s likely a categorization process and a good deal of error-baiting along with it. A one-size-fits-all approach towards Big Data engineering (i.e. Big Data Solutionism) has serious risks and they’re rarely mentioned, relegated to an afterthought at best.

The Government

A lot of public money has been invested in Big Data initiatives. Much more stands poised for investment in the near- and long-term. Many of these initiatives have a lot of potential to impact the public good, but, without a healthy dose of skepticism and a bit more focus, they’re also ripe for predatory consulting and waste.

The Government’s use of Big Data is also the subject of a lot of apprehension and uncertainty**. A lack of adequate transparency is a major driver of this, but how can there be transparency when the nuts-and-bolts understanding of the technology (distinct from the operational methods) is confined to only a few and far-between individuals? Transparency and comprehension are closely related, and a culture of vague notions and obfuscated mechanics hinders both.

The Shameless Plug

Intrical can help you cut through the rhetoric, whatever your (initial) understanding of the topic and the technology. Skilled in the software and math generally associated with the Big Data movement, we also maintain a strong understanding of policy and research domains looking to data for answers. We develop solutions that are effective, cost-efficient, and contextually appropriate. Above all, we seek to be a critical, thorough, and incorruptible force in the government contracting arena. Learn more about us at intrical.us.

*If it were crucial to my point to understand what I’m talking about here, I would slow down and/or elaborate. But it probably isn’t, so I won’t.
**This is a dense topic with many nuanced policy considerations that I’m not getting into right now; maybe I’ll tackle it in a future blog post.

--

--

VJ Kapur

computer scientist, musician; Principal Engineer at Intrical (http://intrical.us), drummer/composer in Strange Victories (http://strangevictories.com)