Big Data is Not What It Was
In 2001, Doug Laney wrote a paper that represents, and continues to be recognized as, the conceptual underpinnings of “big data.” It should: Laney articulated the contours of our modern experience with data in ways that have fared remarkably well against the test of time. Some might tinker with his definition, but few question it.
The funny thing is that during the intervening 14-plus years, our language (and, in some cases, our thinking) has not evolved at the same pace as the reality of the data we use, the tools we use to help make sense of it, or the interfaces that we use to understand and explore it.
“Big data” isn’t really “big” anymore: it’s just data.
Increasingly, the data that we think about and use comes in from social applications, financial transactions, and mobile devices and, just as increasingly, everything seems to be — in some way, shape, or form — a social application, a financial transaction, or mobile device. The data we generate often says something about our location; movement; proximity to others; relationships; personal beliefs, preferences, and tastes; patterns of life; and a myriad of other facets that emanate from (and are reflective of) our modern, connected, Web-based lives.
There are legal and ethical issues surrounding this data, sure, but there is no denying that the data exists and that we play a role in generating and using it: this data is absolutely massive in terms of its size and mind-boggling in terms of its diversity.
As analysts, though, we tend not give a second thought to how many tables or columns or rows that we are dealing with: we are focused on questions we are trying to answer or on the problems that we are trying to gain insight into.
So if the scale of the data is essentially a moot point, what matters?
I think that the first thing that matters is the (analytic) user and the technologies they use to do their jobs.
I posit (and am happy to get shouted down for writing) that the most widely used analytic tool in the world is the spreadsheet. Admittedly, I am not sure how frequently anyone uses a spreadsheet as spreadsheet any more: they’re databases, models, planners, forms…
The spreadsheet has evolved into the Swiss Army knife of productivity tools.
When talking with people, though, I tend to use the term “analytic fabric.” By this I mean an interface that allows the user to see all the data in one place and lets them manipulate, envision, and explore that data in ways that the good, smart people who design purpose-built applications might not have been able to imagine.
Black-box solutions are arguably worse.
As an analyst, I need to have confidence in the data I am using: my job depends on it. Transparency is the path to confidence.
So, back to the analytic fabric that is the humble spreadsheet: its intuitive interface and inherent flexibility all but eliminate the barrier of entry for even the most basic users when it comes to playing with and experiencing the data for themselves.
Now contrast that with many of the technologies and tools used by data scientists.
Ideally, everyone in every analytic shop would be a data scientist armed with the statistical and programming skills needed to determine which tool is most likely to solve an analytic problem and how best to work their way through that problem. It is this combination of analytic and imaginative thinking — and the developer skills to support it — that makes their work so incredibly powerful and valuable.
Unfortunately, data scientists — as they are currently conceived, trained, and branded — are a scarce resource…and are likely remain so for some time to come. I might try to work my way through the Khan Academy and Code Academy and try to become more like a data scientist…but the learning curve is justifiably steep (even as those sites and others like them do a laudable job of helping aspirants ascend those learning curves).
One can also talk about machine learning — and deep learning is absolutely fascinating — but it is still, to most organizations, years off from being a routine part of their analytic processes.
Even saying something as simple as “All the analyst has to do is write a couple lines of code” dramatically increases the cost of curiosity and creates a formidable, if not impenetrable, barrier of entry for many analysts.
So, if we can stop thinking about “big data” and just think of it as “data,” I suggest that we stop thinking of information technologies as such and start thinking of them as cognitive aids: cognitive aids that need to support the full range of analytic users, be they quants or quals.
For better or worse, the lingua franca that is understood by both is the analytic fabric commonly known as a spreadsheet.
System performance certainly matters…as does storage capacity and computational power. Given how inexpensive storage and compute power are, however, I tend to assume those things away…unless a system is slow; budgets force data into some odd bureaucratic trade space (“Do you really need all that data? Can we get rid of some of it?”); or it takes so long to arrive at an “answer” that the analyst either has forgotten what sparked the original question or has been deterred from asking a follow-up. If those problems exist, a more fundamental conversation needs to take place.
The user experience, however, cannot be assumed away: regardless of their technical sophistication, analysts have to discern the interesting and the important (and the potentially or conditionally important) in large and diverse data sets. What happens upstream and downstream of the analytic user is important, but if we think of those stacks of technology as contributing to a a larger, seamlessly integrated cognitive aid, then the analytic user becomes the driving force behind all architectural, design, acquisition, and implementation decisions….and most analytic users are not technologists or data scientists.
Dennis J. Gleeson, Jr., is formerly a Director of Strategy in the Central Intelligence Agency’s Directorate of Analysis. The views and opinions expressed in this piece are his alone and should not be construed as being those of either the Central Intelligence Agency or of the US Government.