On Our Current Obsession with Data-Driven Everything
We are entering the age of big data. With the proliferation of technology into all aspects of our lives, it has become much easier to collect information on every single interaction. Large amounts of data is analyzed to discover patters that will guide our future decisions and the decisions of future machines.
Big data, machine learning, AI are all big buzzwords now. Data will have a profound impact on the next wave of innovation and progress. It will make many aspect of our lives easier and more efficient.
But not all that shines is gold. Not everything that is data-driven is correct and not all that is not data-driven can be dismissed. Big data is still better than no data but it is not a panacea.
There are many pitfalls that we should consider when approaching big data.
Fundamentally speaking big data decision making is basically collecting information on a previous behavior of a system and hoping that the system will behave similarly in the future. It assumes that the available data set is an accurate and complete representation of the system at the time of decision making.
1. This assumption only applies if the system’s dynamic frequency is less than the frequency of the data set collection.
2. The assumption also does not hold if the system goes through abrupt and drastic changes.
3. Another inherent assumption is that we are able to capture all dimensions of the system. Sometimes there are hidden dimensions that can not be captured easily by data.
4. And lastly it assumes that data collection is accurate and remains representative of the real system which is not easy to achieve.
Big data is used to make decisions about complex, real-world systems whose structures are too difficult to understand. Since we don’t fully understand the systems, there can be many scenarios in the real world where the above assumptions do not apply.
Since the world itself changes, most real world systems are dynamic and change their behavior with time. Depending on the number of disruptive phenomena simultaneously occurring in the world, the pace at which the world itself changes also varies with time. The pace at which the world changes impacts everything else around it.
Let’s look at few examples where data driven was not very successful
- Inaccurate data
The most recent presidential election was a “big data failure”. Many people were surprised by the outcome because almost every pre-election poll was wrong. In this case the data available was not representative of the real system. No matter how much data one has if the process of data collection is wrong the data will not help.
Having data sometimes provides a false sense of security because few of us assume that the data could be wrong. From my personal experiences as a science researcher, collecting accurate data is pretty hard.
2. Drastically changing system
The past financial crises and the collapse of the housing market conceptually can be viewed as “big data failures”. Large financial institutions made their decisions based on previous behavior of the housing market. They assumed that the housing market would not change its dominant behavioral dynamics (since it had never done before). Therefore they misunderstood some of the risks in their models as the models were based on past data. I am sure many people said that everything will be fine because that’s what their data shows.
This is an example where a system went through gradual changes till it reached a tipping point when the dynamics changed drastically. In these cases unless we have frequent data collection that catches these drastic changes in a timely matter, big data could lead to a false sense of security and wrong conclusions.
3. Capturing psychological value with data
Google, arguably one of the most data driven companies in the world, makes most of its decisions using data. Yet we can all recall Google Plus, Google Glass or Google Wave. These can all be viewed as data failures. It is difficult to know which aspect of the process failed without detailed information about the type of data that was collected and analyzed.
What was common to all these products was that they all had a psychological value component. When humans make decisions that are utilitarian, they are much easier to predict and replicate. In that case big data is easier to apply. This is one reason why most of Google’s successful internal innovations are mainly utilitarian in nature.
But when it comes to psychological value, the dynamics are less regular and also change with time. Sometimes there are significant behavioral changes that occur when society is introduced to new ideas. When it comes to human behavior involving psychological value, large changes occur as our societal value system changes with time.
Apple’s CEO Steve Jobs, who was working on products that had large psychological value components to them, was not a fan of data. He made many decisions intuitively.
I have also heard that SnapChat’s CEO Evan Spiegel, who is also working on products that provide significant psychological value, does not give that much weight to data.
4. Hidden data dimensions
When Leicester City won the Premier League in 2016 it was a 5000 to 1 underdog. These odds show that people were betting using previous data patters to determine the likelihood of the team winning. But somehow previous behavior of the Premier League could not account for such an outcome.
This could be because they were only able to collect data on tangible factors like budget, previous history or athletic abilities but not on less tangible factors like team spirit, motivation, etc. Sometimes we are unable to collect data on all dimensions of a phenomenon and therefore have an incomplete data set, which then leads to wrong conclusions.
5. Large-spread disruption
When people were trying to understand how Facebook would evolve and impact society, they used previous data points of similar companies like Google or MySpace. Yet all these models underestimated the impact Facebook would have. In reality it behaved more drastically because it disrupted many aspects of the social fabric. We call some companies disruptors and these true disruptors behave like nothing else before.
Data is less useful when it comes to disruption because by nature the previous data is not representative of disruption.
6. Highly dynamic systems
Sometime people use big data to analyze the financial markets. But using big data to analyzing many fast growing companies is not that easy. For example companies like Netflix or Amazon change so fast in their essence that at certain points in their history, it is not meaningful to compare their behavior in the past and present. When Netflix changed from a DVD-by mail provider to a content producer and online distributer, its previous behavior had no significance on its future one. When Amazon changed from an online bookstore, to a content producer and web infrastructure provider it became something completely different. I like to say that the past does not predict the future it just lets us understand the dynamics of the past system.
7. Drastic innovation
Big data is not a good approach if one is looking to innovate drastically. Companies that want to disrupt and create something drastically different should not use big data for their innovation decisions. By nature if something is highly innovative and disruptive it can not be captured in past data.
Of course analyzing data is always meaningful and one can use it to assess patters of the old system to better understand the system one is disrupting. But the innovation process cannot be data automated. At least not at our current level of data driven decisions. Maybe a future AI that is much more intelligent can do this.
There are many more examples where big data or aspects similar to big data would be unsuccessful and would lead to wrong conclusions.
In addition to the challenge of collecting accurate data, another big challenge is determining which data should be included or excluded from the data set and which data points should be bundled. For a system that changes at a fast pace, including data points that were collected in the past may lead to wrong conclusions.
Most systems change their behavior continuously, some faster and some slower. There are also systems, which evolve gradually with time and then change their behavior abruptly and drastically; in some sense they tip at a certain point. This is because some hidden dimension of the system has tipped. If we look at the world as a continues dimensional system than there are many such outcomes.
Since we cannot easily capture every data dimension that describes a system, we may think we are including all dimensions but we are not. We can imagine a system that has a hidden dimension, which has a binary component to it. When we analyze the system’s data we will assume that all that data is comparable but in reality it should be two different sets.
Also correct bundling of data is very important when we process data. When we are doing data analysis in essence we are taking previous unique experiments and viewing them as a set of comparable and similar experiments. This applies fully for experiments in controlled, repeatable environments like in physics, chemistry labs or for some web interactions. Most real world experiments are not comparable or repeatable and are instead unique. Every moment in life happens only once.
Every time a real-world interaction happens it occurs under different circumstances and initial conditions. So by using big data in the real world we naturally introduce some error into our analysis and it is important for a data scientist to assess how large that error is in each specific case.
And lastly if we look at the big data processing algorithms without going in too much detail, the most common algorithms nowadays use some form of gradient descent. Therefore they are influenced by the starting point of the optimization process especially in a continues dimensional space.
Why am I sounding out these words of caution? In our society when something new and transformative comes out and turns into a buzzword, we have the tendency to overly idealize it and view it as flawless. After a while people become more complacent and do not even go through the common sanity check and take it always as true. This can lead to drastic outcomes.
We can think of the Internet bubble as a phenomenon where society became overly excited with a new development and people thought that everything related to the Internet is amazing.
The aspect that makes big data even more dangerous is that it involves decisions and if we become accustomed to computer processed data completely guiding all decisions and turn off our own thinking, this could have a much more drastic outcome.
Another undesirable aspect of the obsession with data driven is that people start to dismiss anything that is not strictly “data-driven”. There are many phenomena in the world that we don’t have the right means to collect data about. But it does not mean that there is no value in developing ideas or theories without lots of data. Sometimes ideas can make us think in a new directions and stimulate new thoughts.
Most of the theories of the psychoanalyst Sigmund Freud were not data driven. Although he had many ideas that can be easily discounted, he also had some interesting thoughts that shaped our future understanding of the human psyche. But most of his theories are not considered meaningful today by the current experimentally minded approach to psychology, because they can not be experimentally verified. If we become accustomed to discounting all non data driven theories, we could lose out on interesting thoughts that could stimulated our thinking and lead to improved understanding in many fields.
Sometimes intuition can lead to very interesting and useful thoughts. For example many dyslexics (Steve Jobs was one of them) tend to reason by insight because they are better at capturing complex systems. Insight is not that drastically different from data-driven. It is a result of life long experiences that provide data points, which lead to a fundamental understanding of a phenomenon or the structure of a system. Just because these data points are not readily available in a computer file, does not mean that they did not exist.
So when it comes to data-driven or big data we should proceed with caution and not automatically assume that all data-driven arguments are correct and also keep in mind that non data-driven arguments should not be automatically dismissed.
Since I am a dyslexic, I am prone to spelling and grammar mistakes. Hopefully it does not distract from the substance of the article.
Thank you for reading this article