Data Driven Decision Making? Meh
Not entirely what it’s made out to be
Introduction
In the era of exponential data growth and ubiquitous machine learning applications it is easy to become convinced that all decision making will eventually be solved by intelligent systems that feed on data. When in doubt, adding more data to the computer will solve the problem. As the superintelligent Multivac computer in Isaac Asimov’s short story The Last Question put it when asked whether any question can remain unanswerable in the face of infinite data:
NO PROBLEM IS INSOLUBLE IN ALL CONCEIVABLE CIRCUMSTANCES.
This drives some people to try emulate this model of decision making: purely based on data, without emotion, without “bias”. Indeed, there is a yearning among executives towards this idealised form of management: in a NewVantage Partners survey from 2018, 98.6% of executives aspired towards a data-driven culture. Yet it remains ever beyond reach as in the same survey only 32.4% report success in finding it. The main bottlenecks are supposedly cultural resistance and failure of leadership, but not that perhaps their vision is itself unrealistic.
Richard Dawkins, when asked about the slow speed of displacement of religion by science states that:
If our current scientific explanations are not adequate to do the job, then we need better ones. We need to work more. […] We need better science, we need more science.
This represents the Enlightenment era ideal of science, reason and measurement conquering all problems known to mankind, including how to run a business or a country. I’ll try to refute the usefulness of such an approach in a decision making environment.
Some examples of practical decision makers under uncertainty (and their source of data) in the following context are:
- Product managers in web-facing technology companies (data scientists/analysts)
- Military generals (intelligence analysts)
- Government officials (advisers)
- Fund managers (quants)
- Mom-and-pop style investments (newspaper)
Note that the decision making here does not imply to sterile environments with clearly defined risk measures and empirical distributions such as:
- insurance
- gambling
- optimisation of manufacturing production lines
- estimating short-term effects in randomised trials
- deciding whether an image contains a cat or a dog (using a cleaned up dataset)
but rather real life environments with a lot of non-quantifiable (Knightian) uncertainty.
Fundamental limits of learning from data
There are at least three large problems with learning from data:
- inferring causality
- inferring long-term effects
- forecasting the future
Inferring causality
The trite saying that correlation does not imply causation is especially known to economists who are supposed to give policy advice i.e. how to guide actions to change outcomes for the better. This is quite difficult to do by just looking at the data (correlation).
Consider the task of bringing down crime rates by the optimal allocation of police stations. You observe that the number of police stations is positively correlated with crime between different districts. A pure prediction model would imply that increasing the number of law enforcement results in higher crime rates — leading to extremely erroneous decisions. What we actually need is a model of the impact of a change in police numbers to a change in crime rate, we want to know the treatment effect or uplift of a change in policy — how much is the policy better than no policy.
This requires setting up either a randomised experiment by the policy maker or (more feasible) by clever researchers to come up with a natural experiment that quasi-randomly changes the number of police stations in some districts. The treatment effect can then be recovered by sampling two time points before and after the “treatment” and comparing the average difference between district crime rates between treated and non treated police stations (this approach is called a difference in differences model).
Note the general implications of this: no matter how many petabytes of observational data on police stations and crime rates you gather, without either running a costly experiment or an ad hoc human-defined analysis, you can’t recover the treatment estimate (the true “knowledge” you are after). Scaling the amount of data alone does little to improve your decision making ability.
Knowledge is slow
How beneficial is a vegan diet to your body? This is a question I came by a while ago, trying to optimise my nutrition from a health perspective. After skimming through many articles on PubMed, I came to realise this is a very hard question. To estimate the true effect, you would have to randomly sample sufficiently large populations to a vegan (strictly defined) and a control diet (either what they would otherwise eat or some defined average diet) and enforce this for a long period, preferably a few years, but possibly for multiple generations (if you are afraid that the diet might have genotype altering effects, as a vegan skeptic pointed out to me at a dinner party). In practice, such intervention studies are only run for a few weeks due to the prohibitive cost and our limited knowledge is based on that.
There is a fundamental tradeoff between attaining confidence in knowledge and time/resource cost. Let’s assume the best case scenario: you are a Google search engine developer and want to test two separate ad placement options:
- A — large and annoying
- B — small and forgettable
You can randomly split incoming sessions between A and B (through a technique called A/B tests, discussed in more detail below). By the simple clickthrough rate metric, A might be preferred to B. However, an obvious consequence of being exposed to annoying ad A is decreased retention — the user will choose an alternative search engine next time or install software that blocks ads. Clearly Google cares more about long-term retention than instantaneous clickthrough rate. In order to properly test for this, the developer should run an A/B test splitting on users instead of sessions for a longer period of time, to see in which group users start dropping off like flies. Ideally, the test would be run for a very long time, up to the average lifetime of the user (which is prohibitively long for Google users). In practice, the developer might opt for a 2-week test, designing some arbitrary compound metric based on clickthrough rate and retention (how many users keep coming back through the 2 weeks) and base her decision on this.
Most metrics that we care about in real life are long term:
- companies try to maximise lifetime profits/revenue
- in life, you try to maximise the sum of happiness over time
- governments (perhaps) try to maximise re-election probability
Testing the treatment effect then becomes tricky: your actions (for example, smoking) might benefit the short term while hurting the long term. Learning long-term effects is also difficult for computers: OpenAI Five, celebrated for employing strategic thinking in the game of Dota 2 (looking at the state of the game and determining which actions have the highest treatment effect towards winning the game) had to train on 180 years worth of games per day, a luxury not available if the data generating process runs in real time.
To further exemplify how difficult it is to incorporate long term effects into decision making, consider the following anecdotes:
- Smoking, today understood to be the largest contributor to premature deaths, required decades of research and a meta-analysis of 7000 articles by the Surgeon General Luther L. Terry in 1964 to make it to a definitive stage in decision making in the US. Even then, there’s a special section in the report highlighting that while most studies provide an association between disease and smoking, it’s much more difficult to prove a causal link (that smoking actually causes disease). Using the epidemiological method, which itself essentially means imposing hopeful assumptions on observational data, the study suggested a causal effect for lung cancer (for men only) and bronchitis but not for cardiovascular diseases.
- Lead has been known since antiquity to be a health hazard. Yet it was introduced into petrol fuel in the 1920s. The industry lobby groups suppressed research into the harmful effects for 50 years, until Herbert Needleman provided sufficient evidence that lead exposure lowers children’s intelligence. Despite trying to discredit the findings again, the industry was forced into decline after EPA phased out leaded fuel between 1976 and 1996, reducing blood lead levels by 78% and crime rates by 34%. It took until 2011 for UN to be able to claim lead petrol phaseout worldwide — almost a 100 years after the introduction of the product!
- The long-term health impact of the Chernobyl disaster is still widely debated today. While there seems to be a strong link with increased thyroid cancer, the estimate for the number of excess deaths ranges from 62 (UNSCEAR 2008) to 200000 (Greenpeace 2006, not peer reviewed). The difference comes from estimation of the long-term effect: UNSCEAR says it is zero and Greenpeace says it is large because they are against nuclear energy. The matter is entirely subjective: there’s no scientific way to estimate a treatment effect that is not statistically significant.
- 5G conspiracists undermine the technology by pointing out that there have been no long-term studies done for the health impact. But there are no long-term studies for many modern changes in lifestyle, for example the mental health effect of reading “alternative” news outlets.
Prediction is hard, especially about the future
There has been enough literature written on the subject that if you still believe that your pension fund manager can predict the market or your local government official the direction of the economy or the geopolitical pundit on the evening news show the course of future conflicts with any degree of confidence, you are not only wrong but should quickly revise your beliefs to not place trust in unwarranted places.
Any inkling towards big data and machine intelligence solving forecasting and reducing the uncertainty of future should be briskly wiped away. Contrarily, the network structure of globalised economy and increasing speed of innovation hints at the opposite: the future becoming less and less foreseeable. Contrast this with the relative stability of peasant life in medieval Europe.
A known result of chaos theory is that it is quite useless to predict weather more than a week or two ahead due to the complex nature of the partial differential equations governing its evolution. We understand weather dynamics fairly well — it’s just the mathematical obstacle of instability that haunts our forecasts. The implications for something such as the economy or a war or product market fit should be clear: complexity-wise it has the same properties as weather, but instead of a physical model governing the internal dynamics we have no clue. To quote the later convicted investor Mark Hanna, immortalised by Matthew McConaughey in the movie Wolf of Wall Street:
Number one rule of Wall Street: I don’t care if you’re Warren Buffett or if you’re Jimmy Buffet, nobody knows if a stock is gonna go up, down, sideways or in circles, least of all stock-brokers.
A new hope: A/B tests
Chances are that as a reader of this article you know what A/B tests are. Nonetheless, a primer: large web-facing companies can split variants of their product offering into groups. This provides a straightforward way to estimate how the outcome business metric (say clickthrough rate) differs between the two groups and to choose the product version which achieves the best result.
This is what’s called a randomised experiment in science and provides the golden standard of estimating the treatment effect. In this regard, large web-facing companies are the most scientific in their approach to product development. By the way, the findings from companies like Netflix, Microsoft, Google show that 50–90% of the product ideas deemed beneficial by human intuition are actually not — a great justification of such a tool!
Randomised experiments are great for solving the causality estimation issue if you can do them. Unfortunately, we can’t A/B test different methods of tax returns or different outcomes of invading a country. In most real life domains, running experiments is too costly or infeasible to be of practical value. This is why economists are obsessed with “natural” experiments where they get the experiment setup for free.
Another drawback comes from the nature of statistics itself: say you work at Facebook and devise a clever algorithm that increases revenue 0.01%. At their scale, this translates to $7M a year. However, depending on the experiment setup and the natural variation in the data, the effect size could be well below the threshold of statistical significance. For $7M you can hire a whole team of engineers and statisticians in the Valley area, unable to justify the purpose of their work as proving it is fundamentally impossible.
Yet another drawback comes to platforms that combine multiple market sides — for example Uber bringing together drivers and riders or Airbnb hosts and tourists. It’s not possible to isolate the treatment effect anymore — say you give a 50% discount to one Uber rider and nothing to the other to see its effect on ordering a ride. As they are both ordering from the same driver pool, the behaviour of the first rider affects the second one indirectly through network effects, rendering the main assumption of independence invalid.
Imagine an evolutionary search algorithm which evolves several iterations of a website, A/B tests their performance, chooses the best one, evolves that one further, tests … for multiple iterations. Theoretically, if all goes well that should yield a superior website to any human efforts. Such an approach has been used for very well defined specific research environments: NASA designed an antenna that way. However, even today’s tech unicorns do not use this approach for product enhancement due to the three problems listed above: you might infer the wrong causality, you can’t measure the true long term effects we care about and you don’t know if the results will hold in the future. It’s just too risky to do it in a purely “data-driven” way. A/B tests are the best tool we have for estimating treatment effects, yet we have to make decisions also when they are not suitable.
How do we cope with these difficulties
We have shown that quality data is costly to come by, regardless of overall data growth. In the push to become data-driven and leave “bias” aside, we use a few simplifying strategies.
Go for the easy metrics
This point is simple enough that it’s taught in business schools: if you measure the wrong stuff, you achieve the wrong stuff. This can mean substituting long term metrics for immediate ones or simply using revenue related metrics over ones deemed too “qualitative”. Taking your quarterly OKRs too seriously is a great way to make bad decisions. Perhaps the most notorious example comes from the Vietnam War, where unable to prove progress in war to their superiors (because there was none), military reverted to simple body count which might have contributed to perverse incentives leading up to atrocities.
Fragile optimisation
Consider the following game:
You have to get from point S to G. There are two possible routes, one (red) directly on the edge of a cliff, the other further away from the cliff. You can move one square per turn. Every turn you get a reward of -1 (say you have sprained your ankle and every move hurts). The game ends either by getting to G or by falling off the cliff (reward -100).
Clearly the optimal path is the shorter one, as you get to the goal quicker. Now consider if you would actually do this. In real life, the environment is not entirely deterministic. Walking with a sprained ankle you might slip, a rock might break, a gust of wind might push you off balance. Most sensible people would choose the blue path in any circumstance. Indeed, given the negative payoff of failure is large enough (say, death), it becomes optimal to take the safe route no matter how small is the chance of perturbation (slipping).
Unfortunately, we are notoriously bad at estimating small probabilities — we often discount them entirely. In this case, without any risk, we would be tempted to take the red route as “nothing can go wrong”. On a high level this is at least part of the reason why we:
- optimise the economy for GDP growth, increasing systematic climate risk
- have nuclear proliferation
- experience the planning fallacy
- experience boom and bust cycles in markets — euphoria and tunnel vision are essentially a form of ignoring small risks
- build supply chains without redundancy that fall apart in case of unforeseen pandemics or a political embargo
To bring a concrete example of how fragile optimisation would work in a product development context, consider that you are running an e-commerce platform. One day you realise that you can increase the average revenue per session by indulging in shady business practices (which you would not want to explain to a journalist) but which are not strictly illegal (so you won’t get punished). One example might be showing fake discounts by propping up the original price, thereby making the customers feel as they were in for a great bargain.
If you were entirely atheoretical (and amoral), this might seem like a great optimisation opportunity: say you run an A/B test, showing the discounts to some users, seeing an increase in their spend. Perhaps you even run a user-based test for a month. If the discounts are subtle enough, it might not show up in user churn behaviour either — overall a great business win! Now, one night, a diligent consumer discovers the scheme, posts about it in online forums, the story might make it up to the pages of the local technology journal. Overnight, you become known as the shady platform, losing 50% of your revenue, perhaps never recovering. By chasing a fragile optimisation goal, you’ve exposed yourself to large downside risk.
What to do
It appears making real life decisions in the manner of an analyst or a scientist in their well-behaved and contained domain is a fool’s errand. What’s the alternative?
Humility
Firstly, we have to be content that there are some things we will never be able to know, no matter how advanced our data infrastructure becomes. This should humble our rationalist hopes in science conquering everything. Especially in the fields of social science, we are actually much closer to knowing nothing and being unable to forecast anything. There are some professions that make a living of signalling the exact opposite. No government official will admit they are unsure if the proposed stimulus package actually helps the economy. Take this with a grain of salt, it’s all for show, because being uncertain is seen as a sign of weakness in the general society. Nonetheless, when surrounded by peers, don’t be afraid to say “I don’t know”. There’s nothing worse than a room full of important-looking suits who pretend to know what they are talking about to maintain status.
Minimax risk management strategies
In (Knightian) uncertain environments, we can never be sure about the probability of different contingencies. An easier point to estimate is your exposure to uncertainty: for example, if you invest 10% of your wealth into the stock market (non-leveraged), then the worst case scenario is that you lose the whole 10%. Minimax strategy is all about insuring yourself to downside risk, making sure that your worst case exposure is containable.
It provides surprisingly actionable insights for real life:
- The Surgeon General’s smoking report in 1964 stated:
Although the causative role of cigarette smoking in the deaths from coronary disease is not proven, the Committee considers it more prudent from the public health viewpoint to assume that the established association has causative meaning than to suspend judgment until no uncertainty remains.
Despite no scientific certainty regarding the matter, the cost of delaying judgement was measurable in human lives while the upside was the tobacco industry’s profits. With such an asymmetric tradeoff, it’s always better to limit the exposure, even if you are uncertain of the truth.
- Even though the UNSCEAR report found no long-term health impact on death rates, don’t buy a house next to a nuclear plant. In fact, support other renewable energy sources (with lower downside risk) over nuclear.
- Take anthropogenic existential risks seriously, no matter how small you think the probability.
- Don’t invest your whole portfolio into speculative cryptocurrencies.
- In general, build less quantified expectations of the future and take a Stoic mindset of accepting what happens and making the most of it.
Common sense / heuristics
While the data-driven rationalist worldview demonises gut feeling due to its bias, I feel it might be slightly unfair. In fact, many of the preferred decision mechanisms are actually ingrained into “simple” thought processes. Consider the things considered suboptimal from a naive optimisation perspective:
- using rules of thumbs — heuristics are easily understandable, therefore reducing risk of misimplementation and misconfiguration.
- doing things as they have always been done — having stood the test of time implies that the current approach is at least somewhat successful.
- doing things as others are doing — by applying this smartly, without anchoring entirely on others but amalgamating crowd opinion and your own can achieve wisdom of the crowds effects.
- better safe than sorry — building redundancies into the system is very beneficial from a long-term risk perspective. Evolution creates a lot of redundancy.
- doing things based on gut feeling — while there is a large risk of bias here, in the absence of data intuition is still better than nothing. Baseball managers before the Moneyball era managed to build winning teams, perhaps not optimally but surely better than random.
Ethics / empathy
In general, in all areas of decision making it helps to be mindful of the ones affected by your decisions and employ some type of a moral compass. As a product designer, you don’t need to run extensive A/B tests to determine that cheating your customers is bad. Taking on corporate social responsibility as an owner might not just feel good but actually increase your brand image and bottom line.
In general, try not to do bad things. If you are entirely cynical, think of this in terms of PR exposure: there’s a vast territory between things that you would not want to explain to a journalist and things that will land you in jail. Yet the court of public opinion has deposed rulers, bankrupted companies and created pariahs — a significant downside risk.
Conclusion
I’m a great fan of automation, where it can be applied. The above is by no means an attack on rationality and the use of statistics and data in general, but rather highlighting the limitations. There’s far too much FOMO in the area of data science, envisioning that someone somewhere is getting the edge by having metamorphosed to a higher level of human rationality, making objective decisions and either leveraging machine intelligence or letting the machines call the shots themselves. That’s not the case if you dive deep into any of the actual cutting edge data-driven solutions such as self driving cars or the YouTube recommendation system. Instead, they consist of patches upon patches of heuristics based on intuition, the hard work of engineers, common sense fallbacks and some machine learning modules to tackle well-defined tasks. If this is the state of the art in products, the implication for processes (with a larger degree of uncertainty) is clear: our monkey brains are not obsolete yet, decision making is very hard to automate based on quantifiable data alone.