Delusive Extrapolation and A/B Testing

Cautionary Tales of Complexity and the Dangers of Jumping to Conclusions

Staffan Nöteberg
The Pragmatic Programmers

--

Illustration by Anni Nöteberg

A/B testing is becoming increasingly popular in digital product development. Early feedback on the choice of direction may save you a lot of time and money. However, be careful about drawing too far-reaching conclusions from your test results.

Imagine that for one day we let 5% of our retail’s online customers test a new feature. We notice that those who have the new feature (the test group) spend on average 20% more money on our site than the other 95% (the control group). This is an A/B test: we’re arranging context A (the new feature) for one group and context B (no new feature) for another group. We then analyze if behaviors differ between the groups as a consequence. From this analysis, we may extrapolate new theories, for example, that if all users gain access to the new feature tomorrow, we will increase total sales by 20%.

A/B Testing and Extrapolation

Extrapolation is when we use a number of observations within a context as input to the process of guessing what observations we will make outside that context. In the example above, we extrapolated the relationship between sales and time: what happened today shows how it will be tomorrow. We also extrapolated sales’ relationship to the user population: the sales increase we saw in the test group shows how sales will be for the control group when the latter gets access to the new feature. However, a quote attributed to an unknown Danish politician is relevant here:

“It’s difficult to make predictions, especially about the future.”

At the turn of the 20th century, the American advertiser Claude Hopkins was faced with a difficult decision. He had sketched two promising campaigns for his product, but he could not decide which one to launch. He decided to start both campaigns and compare the results using the return rate of promotional coupons, as he described 1923 in his seminal book Scientific Advertising:

“We learn the principles and prove them by repeated tests. This is done through keyed advertising, by traced returns, largely by the use of coupons. We compare one way with many others, backward and forward, and record the results. When one method invariably proves best, that method becomes a fixed principle.”

Although Claude Hopkins did not use concepts like statistical significance or the null hypothesis, this may have been the first A/B test.

As the Internet became faster and more widespread, new ways of changing context for subsets of the user population emerged. For example Google have tested varying their logo, their link colors, and the number of links in their search results. Data-driven decisions have since become highly regarded, as they are perceived to be based in fact.

However, products and markets are complex, and their relationships are even more intricate. Conclusions extrapolated from A/B testing and other types of early market observations are in danger of being simplistic.

Feedback Loops and Nonlinearities May Alter What Events Entail

For example, the test group is not always as randomly selected as was intended. Google Glass initially generated a lot of positive attention. However, when all the tech enthusiasts and early adopters had become consumers, there were not many curious people left. This mistake is an example of sampling bias. A volunteering test group does not always fairly represent the majority.

Another flaw might be the size of the test data. Assume we are conducting multiple tests on a small dataset of hit songs to find correlations between various factors, such as tempo, key, and lyrics. However, these factors might not be predictive of success. Testing multiple variables on the same test group may result in data dredging. The consequence is that we mistakenly believe that we have found a recipe for success.

What Causes What?

Complexity makes it hard to immediately pinpoint the direction of causation. Procter & Gamble spent decades developing Olestra, a zero-calorie substitute for fat. Health concerns — fat in snacks — initially seemed to cause the success of Olestra. Then reverse causation occurred. The success caused health concerns due to the unpleasant side effects of Olestra, such as gastric cramps and diarrhea. Olestra never delivered the return on investment that Procter & Gamble had initially estimated.

Sometimes the cause is a third variable that we didn’t initially think of. Suppose that statistics suggest that the consumption of energy drinks leads to more sports-related injuries. However, risk-tolerant people might be overrepresented both among heavy consumers of energy drinks and practitioners of extreme sports. Risk tolerance is the common causation for energy drink consumption and injuries in this example.

History is also full of examples in which immediate product success was a result of two or more parameters that coincided in odd cases. For example, a sleek design, an intuitive interface, and powerful computing capabilities were three of many factors that contributed in a multicollinear way to the iPhone’s market success. Apple may have given up early if their A/B tests had only taken one of those factors into account since the combination of many factors was the necessary condition for the iPhone’s outstanding attraction.

The context in which a test is conducted might also be different from the extrapolated context. Coca-Cola experienced the McNamara effect when the sweet New Coke received negative reception from the market upon its launch in 1985. This happened despite extensive market research indicating that a significant proportion of consumers preferred the sweeter taste of Pepsi over Coke. Coca-Cola focused too heavily on quantitative data and failed to take into account customers’ emotional attachment to their product’s properties. A Pepsi/Coke blind test is not the same context as when a customer choses a product in a store.

Time matters

In addition, history doesn’t repeat itself as often as we would like. The popularity of tulip bulbs in the 17th century in the Netherlands was largely driven by their novelty and uniqueness. In February 1637, tulip bulbs were sold for more than an artisan would earn in ten years. After a while, however, the novelty effect disappeared, and so did the market demand for tulip bulbs. Many people who had invested heavily were ruined. Extrapolating the relationship between market price and time might not take into account that the excitement caused by newness is temporary.

An extrapolated context also involves a larger ecosystem than the product and the market. Hybrid cars initially attracted buyers with environmental concerns and a desire for fuel efficiency. Once a critical number of cars had been sold, however, additional factors caused a non-linear relationship between hybrid cars sales figures and time. The market increased exponentially due to government incentives and social pressure. Extrapolation based on early evidence does not show exponential potential of this kind.

Early disappointing observations can also have a negative impact on our diligence. Electric cars are finally becoming popular, despite the technology having been available for decades. One reason might be that it took a long time to establish the infrastructure necessary for their success. If a car manufacturer would have extrapolated the low interest initially, they would risk missing large future returns. Time lag is at play.

What about the new site feature — where we started this story — that increased sales by 20%? In this context, we might be victims of survivorship bias. What is not seen is the impact the new feature had on those test group members who did not buy anything during the test period. They might be customers who infrequently spend significant financial resources on the site and now unfortunately will never return since they did not like the change.

Executive Summary

The complexity of products and ever-changing markets tells us to be cautious with far-reaching conclusions based on A/B tests and other early observations.

A/B testing may lead to generalization. At worst, the generalization becomes an over-generalization, a type of cognitive distortion where you transfer conclusions from one event to all other events, regardless of whether these events occur in a comparable context. The risk is that you miss exceptions, counterexamples, or alternative perspectives. This, in turn, can lead to incorrect conclusions.

Always think critically about the assumptions or implications of the test conclusions and see if they are reasonable, sufficient and logical. Look for alternative sources of information or viewpoints that may offer other insights.

And never forget the impact of timing. What is true at one point in time may later be false, because things we mistakenly thought were not relevant have changed.

For that matter, don’t overgeneralize the message of this essay either.

Staffan Nöteberg is the author of The Pomodoro Technique Illustrated, published by The Pragmatic Bookshelf.

You can also read The Pomodoro Technique Illustrated on Medium.

--

--

Staffan Nöteberg
The Pragmatic Programmers

🌱 Twenty Years of Agile Coaching and Leadership • Monotasking and Pomodoro books (700.000 copies sold)