The Big Data Revolution

Big promise, big problems

When scientists develop hypotheses, they often begin with an assumption or a hunch — usually one that grows out of personal experience — before devising experiments that might prove or disprove their points. But is that the best way to get at a truth or solve a problem?

In their new book, Big Data: A Revolution That Will Transform How We Live, Work, and Think, Viktor Mayer-Schonberger and Kenneth Cukier point to a better method: examining a vast pool of data points. The authors write about how innovators have begun analyzing huge amounts of information in order to discover patterns that will help answer important questions like how a disease spreads through a continent; how to predict the safest routes for travel in war-torn regions; or how the ebb and flow of traffic in different areas can predict the demise of local economies. But they also point to the dark side of big data: Now that it is so cheap and easy to track every run-of-the-mill Google search, every phone call, governments will be all the more tempted to become Big Brothers, recording citizens’ every move. Until recently, the costs of collecting the kind of information that prompted former NSA contractor Edward Snowden’s leak, as well as the physical requirements for storing it, were so extreme as to make it nearly unthinkable. No longer.

Big data, like other tools of progress, is poised to change every facet of our lives: “The computer didn’t just make calculations easier; the printing press didn’t just make reproducing bibles easier,” Mayer-Schonberger and Cukier told me in an email. “Big data will similarly affect everything: creating information, automating processes, and expanding the scope of literacy and knowledge in society.” They added, “More subtly, it will affect how people think about the world and their place in it.”

During a recent exchange, the authors and I discussed how these changes will take root, and why.

It requires a certain degree of humility to appreciate that all that is within one’s field of observation and experience isn’t all there is.

Q: Big data scientists seem able to abstain from making hypotheses before they’ve collected their data; only after all the information is in do they begin to search for patterns. How can the rest of us follow their lead—clear our heads, so to speak, so that we are not swayed so much by anecdote when trying to solve a problem?

A: Anecdotes and experiences can help illustrate a point, even though they rarely can prove it. It requires a certain degree of humility to appreciate that all that is within one’s field of observation and experience isn’t all that there is. The day will come fairly soon where the default view will be to learn from data and temper our individual observation with what we can see from aggregating lots of information. Yes, there may be a small minority who resist this — just as there are people who believe the Earth is flat since from their individual observation, since that’s what it looks like. But society advances.

Q: So will the day come when relying on the observations of individuals will become obsolete? When we’ll rule out anecdote or hunches, feeling they muddy the waters too much?

A: If we see the world only as data, then we run the risk of fetishizing the data, of imbuing it with reason and meaning that it does not have. We need to be vigilant that we are not beguiled by data or lured by the false charms of quantifying every problem.

In the book, we tell the tragic story of Robert McNamara, America’s defense secretary during the Vietnam War, and his use of the “body count” [the number of Viet Cong fighters who were being killed] to understand the progress of the war, when the situation was obviously far, far more complex.

Just as it would be foolish to ignore data, it would be foolish to suspend one’s common-sense and place blind trust in a number simply for a number’s sake.

At a point where everyone is learning from data, what will characterize originality and innovation [is] going beyond what the data says — a spark of insight and a taste for risk-taking.

Big data is helping doctors spot infections before symptoms appear.

Q: What was one of the most striking examples you found of how big data is currently being put to use?

A: Health researchers in Canada are doing something extraordinary. Working with premature babies, they’re collecting and processing real-time flows of vital signals like heartbeat, respiration and blood-oxygen level — more than 1,000 data points a second. By analyzing it, they are developing a way to spot infections 24 hours before full-blown symptoms appear. This lets them intervene sooner and more effectively.

Fascinatingly, one of their early findings is this: a predictor that a baby may be developing an infection is not that the vitals go haywire, but that they stabilize. It’s not what medical professionals in the past might have thought. But big data tells us this.

In the past, information like this was thrown away because the cost of gathering it was so expensive — but that’s no longer the case.

Q: Was big data used in the search for the Boston bombers? If so, how?

A: Yes and no. Arguably for the first time, the public sent a huge amount of data to law enforcement to use, especially digital images and video. The volume of this data was unprecedented, and it would have given the police a very rich data source to utilize. In the end it wasn’t necessary to use this data, however, because the bombers gave themselves away by stealing a car and shooting a police officer.

Q: In your book, you talk a bit about 33-year-old Luis von Ahn, who was just 22 when he designed CAPTCHA — the system that generates those prove-you’re-human typing tests that involve distorted letters. He went on to create reCAPTCHA, a similar system that has real-world usefulness: It helps with the process of digitizing old books. The words that Internet users are asked to type are ones that optical character recognition programs could not identify; once they’re entered, reCAPTCHA sends them on to the digitization teams like the one that’s archiving the New York Times. Thanks to reCAPTCHA, more than 40 million words per day are transcribed. What spurred Luis to improve on his creation — a system that was both ingenious and perfectly accomplishing what it was supposed to accomplish?

A: He’s a really funny, charming and understated fellow. Around 2006, he did a back-of-the-envelope calculation and realized that there were around 200 million CAPTCHA’s being done each day. At ten seconds each, that amounted to almost 5,000 hours a day of “lost” productivity. Most people wouldn’t have seen it that way, but he does, since computer science engineers not only are obsessed with increasing productivity whenever possible, but with doing so in small increments. They think in “MIPS” or “millions of instructions per second” [a unit of measurement that describes how fast a computer can process information].

Luis was aghast. Ever the engineer, he set his mind to thinking about how that time could be better spent typing in things that mattered. He quickly came up with identifying the correct letters that computer systems couldn’t read clearly when it digitized texts. It was his great luck that Google needed this service for its book-scanning project. He’s always looking for these sorts of “two-fers” — ways that he can take something generated for one thing and reuse it for something else, to achieve a “twin-win” for so to speak.

Q: What does Snowden’s leak tell us about big data?

A: Snowden’s actions reflect the uneasiness of young digerati about the power of big data, especially in the hands of government. The public debate about data collection and big-data use by governments — not just the US government, but all governments, actually — that Snowden’s actions has spurred is needed so citizens can decide what governments can do, what safeguards we need, and what dark sides of big data we must avert.