Some Rampant Speculation on Big Data

aka “flaming pile of garbage in, flaming pile of garbage out”.

That pile of garbage statement is a coding adage I like to use whenever somebody doesn’t want to have conversations around what data we should be gathering and how to determine if the quality is actually there. For some reason people have decided that all data is worthwhile to gather and analyze even though this is quite simply a dumb idea.

This may be a personal bias fermented in university physics labs shielded from the waves that provide cell service, wifi, and the proselytizing of companies selling their enormous vats of data swept up from the companies quietly violating any reasonable expectation of privacy we once had.

But I’ll let you decide on the ultimate ratio of bias v. truth.

Let me explain a bit more.

I spent years of my life performing then teaching idealized experiments that our physics forefathers used to discover the fundamental truths that modern physics builds off of. Things like the mass of an electron, the lifetime of a muon, or simply measuring gravity on earth. These experiments had complete guides and we relied on a pretty standard format to report methodology and results.

The thing is though, these simple experiments took at least several tries and often weeks to do right. I want to say that again because it is the basis for this rampant speculation, we were performing an experiment with instructions, and knew the proper answer, but still spent weeks getting the correct data.

You see that is the thing about experiments and the data they provide. Even the simplest of them is difficult to perform well, and every time you perform them you introduce a whole set of circumstantial variables that lead to error.

At one point I was even ready to spin a theory that gravity in the NYU physics building was being affected by secret experiments in the condensed matter physics labs (they do literally play with optical tweezers/tractor beams up there). This was until somewhere between the 10th and 15th try I figured out the problem was my elbow on the desk messing up the leveling of the instruments being used for measurement.

This is why it costs somewhere around $13.25 billion, according to Alex Knapp at Forbes, to find the thing that gives all other things mass. Something as fundamental as proving that the thing which makes it so that your feet stick to the ground does in fact exist takes decades, thousands of scientists, and billions of dollars.

That was the preamble, now I will get to the point.

We have these companies focused exclusively on new ways to gather, combine, and analyze data that can be swept up from your use of virtually any consumer service. But there isn’t very much talk among investors about the quality of that data, or the danger that comes from combining data from multiple sources.

The thing we seem to be forgetting is that none of the analysis and insights will matter if you don’t make sure that nobody has their elbow on the table. Otherwise you will find yourself in the position to spin outlandish theories in order to make your data match with the few truths you do know about the world.

The reality is that most of these companies are pulling data from a bunch of very mediocre experiments. They are investing in a flaming pile of garbage and selling it as gold. Sadly they and their investors will get out of their efforts what they put in, flaming piles of garbage.

A quick, and important caveat.

This isn’t to say that flaming piles of garbage can’t be used for anything, companies like Google pretty much sort and recycle what they can then burn the rest for electricity. But this doesn’t apply to most new kids on the big data block, for one they aren’t Google, and for two they aren’t willing to recognize that what they have to work with is flaming piles of garbage.

Big data isn’t here yet, sorry folks.

That is what all of this means, and it isn’t a bad thing unless you are one of the many unlucky souls standing over a pile of flaming garbage, looking your investors in the eye and saying “see it smells like roses”.

The big data revolution won’t come from the side of brokers, at least not yet, it will come from companies that perform better experiments and increase the quality and the rigor of their data.

Your phone may store the locations you frequent most often, and with some inference you can usually figure out where somebody works, eats, and sleeps with some amount of accuracy. But the true value is understanding why and how someone chooses these things. Those answers come from actual products and all of the companies out there selling the idea that facts about people can be imbued with meaning through analysis want to create gold out of flaming garbage.

I personally will let those data sweepers clean up as much of the flaming garbage as they can, let’s get it out of here. Then lets start working where the data revolution actually is, in the products that are working to better understand their customers. The places where users initiate preferences, and in the process give us tiny glimpses of the truly valuable information we need to make big data a reality.