Once upon a time, in a land of rainbow colored sponge cakes with bubbles, Big-IT companies realized they had a problem; their customers were becoming increasingly reluctant to keep pumping money into the IT stuffs they were peddling. Simply put, FORTUNE500 companies were no longer responding to words such as “infrastructure” or “integration”. In one part, they had already spent way too much money on both, on the other, they had probably started to realize that both ideas were rooted in the benefit of the IT-companies as opposed to their own interests. So the IT-companies did what a respectable purveyor of stuffs do — they sent an urgent message to their friends the Big-Consultancy.
Being master brewers, the Big-Consultancy came up with the perfect mix of emotion and logic, something so irresistible that it would light the world on fire. Not only it would make it seem like all the mindless past spending on IT stuffs was meaningful, but it would justify even wilder spending going forward. In fact, it would make perpetually growing spend seem not only like a good idea, but a necessity.
Indeed, if such a thing of marvel as a perpetual kool-aid brewing machine was possible, the Big-Consultancy was sure this must be it. They decided to call it “Big Data.” Big-IT was over the moon for this new marvelous contraption.
What the (Big Data Kool-aid) Package Labeling Forgot to Mention
The basic idea of Big Data has its roots in the world where data was scarce. In which light the proposition of “big” data seems to make sense; if you have a lot of something that is scarce, you will have some kind of competitive advantage as a result.
The other idea at the heart of “Big Data”, is the just-in-case mentality. And idea that more developed industries such as car manufacturing, clothing, and paper industry had already proven to be detrimental to business. In the case of car manufacturing, the moment car sales started to slow down, Toyota invented the just-in-time manufacturing principle. Today most goods are produced just-in-time as opposed to just-in-case.
When it comes to data, the difference in ‘just-in-case’ vs. ‘just-enough’ (same as just-in-time for physical goods) is staggering.
- Storing data for one second instead of one day = 86,400 times
- Storing data for one second instead of one year = 31,536,000 times
It’s quite typical that in analytics platforms, data is stored for a year or more. Mostly for the purpose of having it available just-in-case for displaying vanity metrics or for some other mundane purpose.
Data as Cost
Actually, data in itself is nothing but cost. Unless you do something to extract value out of it and make it worth something, data is nothing but an expense. In order to prove this, we can perform two simple experiments:
a) as a non-profit entity, take up an AWS instance and put some data on it and let it be without doing anything to it
b) setup a company, take up an AWS instance and put some data on it and let it be without doing anything to it
The first thing you will find is that in both scenarios you will accumulate cost without generating any value.
The second thing you will find is that if you go and try to sell what you have setup to someone, as is, nobody will buy it.
The third thing you will find is that if you try to convince the IRS that what you have in your setup is an asset, they will not see it like that.
The second and third points show how having data is starkly different from having oil or minerals in the ground.
The third point is slightly different for individual who is not required to adhere to what is referred to as the double entry bookkeeping method. For organizations, first we have to consider that there are things that are considered negatives, and things that are considered positives. For negatives there are two kinds; operational expenses and capital expenses. Capital expenses are generally preferred for the fact that they are not “just” expense, but are investments that create assets on the other side of the books (in the positives). The setup in our experiment will be considered an operational expense. It is money going out, without leaving any value behind. At least that is how it plays out in accounting.
The most pertinent of all the catchphrases that came with this new kool-aid, one we’ve all heard to the death, is that data is the new oil. Well, its not.
Let’s go back to our experiment, and now imagine a scenario where you have a piece of land, and somehow can prove there is oil deposited there. The land you are buying is a capital investment, and it goes in to CAPEX side of your books. That means it is treated as an asset on the positive side of your books. As long as there is as much as a whiff of oil, you’ll have offers coming in to buy the land off you as fast you can turn them down. Even the IRS will not debate this.
So yea. No it’s not.
There is a more recent version, which says that data is the new oil, but it needs to be refined. Well then it’s clearly not oil in terms of value, because oil does not need to be refined to have value. It just has more value when it’s already refined. Whereas data is just cost when when its not refined, and can not be considered an asset.
So yea. No again. Also note the slide, how we have a leading PR firm quoting a notable hype firm. Back in the snake oil times of the late 19th century before Edward Bernays introduced propaganda to US, the equal of this would have been a snake oil seller quoting another snake oil seller. In retrospect, that does not make too much sense. Perhaps, in the not so distant future, we will look back to this era and its communication (regarding Big Data) in similar amusement.
Don’t worry, we’re moving on, we’re not doing one “data is the new oil”. One could only wish the big data peddlers would have more fun with their memes.
An Age of Automated Decision Making
Some people like to freak out about machine intelligence. But actually, the history shows that we should not be afraid of automation. Take for example building the pyramids, a lot of people died doing it. There had to be a lot of slaves for the whole project to make sense in the first place. In Dubai they built a building almost one kilometer high and not that many people died building it. Some of the most iconic constructions of today are built without anybody dying. Also, instead of slaves, migrant workers from very poor countries earn a living that feed multiple generations of their family in the countries they originate from.
Over the past 500 years or so, automation has developed in an easy to understand way:
- the steam engine automated power generation
- production lines automated mechanical labor
- the computer automated data processing
- communications technologies automated access to information
Next, machine intelligence will automate access to decisions. This is potentially great news, as frankly speaking we human are not wired for making decisions. As a result, as is the case in each of the previous major developments in the automation “tech tree”, human capital will be more available for making contributions that only human can make. Contributions we can’t even imagine yet. For example, we might finally be able to put our attention and our incredible ability in pattern detection and pattern making, into solving the mother of problems. Problem solving itself.
I have written about this in Solving the Problem of Problem Solving.
In that sense, nothing will change, human ingenuity will be as valuable, no, more valuable, as before. In this regard, it is essential that JCR Licklider’s seminal work on Man-Computer Symbiosis gets more attention, and that researchers consider his vision in the modern day context. What we call today “data scientist”, is the kind of individual that has the potential for creating a symbiotic relationship with computer. Amidst all the hype surrounding the topic, to avoid doubt in respect to what “data science” actually is, in the I-COM Data Science Board that I have chaired since its inception, we came up with the following description for data science:
Data Science involves the theoretical and practical approach to extracting value, which is knowledge and insights, from data.
The practice of data science can be broken down in to four components; the data, the algorithms that are used for processing it, the systems where the data is stored and the people who operate those systems. In another way, the break down is in to two; data, algorithms, and storage/management are computer, and people are man. The key point is the symbiosis of the two.
Actually the idea of computer-man symbiosis was showcased for the first time in a significant way soon 80 years ago, by no other than Alan Turing. In fact, he was the first true data scientists, in the way that he had developed a theory based on patterns, and then developed an instrument around his theory using computer technology. Working symbiotically with his device, he could do what he did in cracking the Nazi code, in some part helping UK win its war against Hitler.
The essential point here is to understand the significance of humans. To understand that more technology evolves, that significance will be increasingly highlighted. The less human capital is bogged down in grunt work, the more it can thrive. In this light, it is very important to understand that any piece of technology that is widely available with low barrier of entry, is a commodity. Actually, the most advanced deep learning platforms of today — Keras, Tensorflow and Pytorch — will seem as impressive to people in the not so distant future, as a piece of rock seems to us today.
Big Cake is Nice Only if You Like Cake
Because unadulterated data is nothing but cost, then it seems accurate to say that “data”, actually is a problem. Which means “big data” is a “big problem”. The fact that there is so much data is not an opportunity, it is an obstacle.
To deal with that obstacle, we have commodity tools in form of algorithms and tools that make managing those algorithms and data more straightforward. To use a comparison with building, if data are the nails, then algos are the hammer, and the various management tools are the toolbox. Even if there is a robot of some sort that hits the nails, there is a human who designs the structure. If there is a deep learning system that designs the structure, there is a human who designs the deep learning system. If there is some sort of an advanced machine intelligence that designs the deep learning system, then there is a human that designs that…ad infinitum. This seems to be the essential point of automation; as much as we human came from the earth and can’t escape that, the machine came from us human. This way, the true value never shifts away from human ingenuity, it just gets channelled in ever more subtler ways. This is great news for us human. Further, it’s not just the question of humans designing, but also operating them. Somewhere, high enough in the abstraction construct, there is always a human pulling some sort of a lever. Again, it is just a question of subtlety. We’ve already come a far way from the grossness of the lever pulling back in the pyramid building time.
Fundamentally “science” refers to the human ability for pattern detection, structuring ideas based on it, and articulating it to others. Once you add “science” to data, it stops being an obstacles and becomes an opportunity. Broadly speaking, Data Science is the opportunity associated with data.
It has to be better understood, that it is not in data we should invest in, as much as we should invest in data science. Meaning that we should invest in people more than anything. Most importantly, we must invest in to completely changing the early education system. Particularly our math education. As Paul Lockhart put it so eloquently:
“If I had to design a mechanism for the express purpose of destroying a child’s natural curiosity and love of pattern-making, I simply wouldn’t have the imagination to come up with the kind of senseless, soul-crushing ideas that constitute contemporary mathematics education.”
Proof is in the Pudding They Say
Let’s close off by bringing everything down to a very simple line of reasoning.
When you are making investment in data, and none in data science, you have no value at all. Even if you weren’t making any investment in data, you could still invest in science and create theories, and find value.
Similarly we can prove data as a problem or cost. When you have no data, you have no cost. When you have data but no data science (to extract value), you have just cost.
If we all agree that technology and algos are (or will soon be) commodity, this clearly shows that data science is the value, because with it in the equation we can negate data’s problematic cost generating nature. and turn data in to value.
In this light, it seems fair to suggest that the data scientist is to the decision age what the engineer was for the industrial age, or the programmer for the information age.
You might also like my commentary on data visualization:
Five Essential Points on Data Visualization
The goal of data visualization is to act as a catalyst for some sort of behavior change, yet mostly practitioners focus…
…or Decoding Intelligence an interview I did recently covering the topic of creativity and machine intelligence.
Finally, a 5-minute video on Secret History of Data Science, my last (and final) public talk on covering the topic of data and technology.