The Role of Statistical Significance in Growth Experiments

The concepts of experimental design and hypothesis testing originate from the work of Ronald Fisher in the early 20th century. These concepts originated as tools for rigorously testing the validity of a hypothesis. For instance, consider the classic lady tasting tea experiment that was posed by Fisher. He posited a scenario where a lady claimed that she could tell whether milk had been the first thing added to a cup of tea and so Fisher ran a test over 8 cups of tea where 4 of the cups of tea had milk added first and the remaining 4 had tea added first, having the lady select the 4 cups of tea where milk had been added first. The null hypothesis for this experiment was that the lady could not differentiate between the cups of tea. Fisher was willing to reject the null hypothesis in this case only if the lady correctly identified the correct 4 cups since there was only a 1.4% chance that she could have done this purely based on random guessing. Thus, it was highly unlikely that she was able to guess this out of pure chance and likely has the ability that she claimed to have.

In this case, the experiment was geared towards one thing: deducing whether or not there was truth to the claim the lady made. Statisticians, including Fisher, continued to iterate on the concepts introduced by Fisher and developed and refined these statistical procedures to be more sophisticated and mature. Naturally, this sort of thinking and testing became incredibly powerful and useful in the sciences since it allowed scientists to run tests and clearly articulate what results were likely legitimate and which results were likely the result of random chance. As time went by, this methodology became used in many contexts not only in academic disciplines but also in industries such as medicine and manufacturing. Recently it has begun to be used by people doing “growth hacking,” which effectively boils down to using experimentation to drive forward online marketing and product web development.

The experimentation process is a natural fit for this kind of work since for many large web services it’s easy to get the volume needed to hit statistical significance, starting an experiment and producing variations for the experiment have a low marginal cost, and analysis of basic experiments can be automated. Naturally, the predominant paradigm for running experiments in this field has been to borrow the same methodologies (standard frequentist inference techniques) that have classically been used for running experiments with the same benchmarks for success which are passed down from the classical statistical lore even if it doesn’t make sense to use those benchmarks in every context. However, there is a difference between the underlying goal for running experiments in these contexts and the underlying goals for running traditional experiments. In most growth contexts, there are fundamentally two types of experiments: those geared towards learning and those geared towards optimization.

Examples of experiments geared towards learning are, for instance, figuring out whether your users respond better to altruistic calls to action or selfish calls to action or figuring out which user personas respond best to your marketing message. These “learning” experiments follow the traditional statistical inference playbook and rightfully ought to as their purpose is to validate fundamental claims about the business and what direction it should head in which is something that ought to be correct with a high probability. Experiments geared towards pure optimization, however, are more interesting and, at least from anecdotal observations, are more prevalent in growth hacking contexts than experiments geared towards learning. Examples of optimization experiments are figuring out what button color optimizes conversion rate on a landing page or what particular text on the signup page optimizes signup rate. These types of experiments are fundamentally different from the original lady tea tasting experiment because they don’t necessarily hope to verify some claim for the purpose of getting to truth but rather to change the inputs to a function (landing page layout, target online advertising market, etc.) in order to maximize a certain outcome making them more of a stochastic optimization problem and less a problem of pure statistical inference. Note, however, that most experiments don’t fall strictly in one or the other category but rather are usually a mixture between the two.

These optimization experiments tend to be treated as though they were learning experiments, though they have a different underlying purpose. Consider a basic A/B test where we test the control of a blue signup button versus the variant of a red signup button. The overall conversion rate of this landing page is defined by a function that’s effectively CTR = f(relevant features) and by modifying the button color we’re positing that one of the relevant features is the button color and that, holding everything else constant, the marginal difference of making this button red will positively impact CTR (note that this sounds an awful lot like gradient descent…). We continue to run these experiments and each time modify some set of parameters with a pure optimization goal in mind (i.e. maximize CTR, minimize churn, etc.). However, the tools we utilize for assessing these situations are the standard tools of statistical inference, usually with p = 0.05 or p = 0.10 levels of significance. The story behind the dogma of p = 0.05 is interesting and its predominance in experiments geared towards learning is potentially explainable, but blindly applying it in the case of optimization experiments doesn’t seem wholly applicable.

Putting aside the debate over whether or not p-values are in fact error probabilities, suppose for now they are then we can interpret 1 — p as being the probability that the function value is (insert variant CTR) and use this going forward to iterate on our broader optimization problem instead of focusing in on whether or not we should reject our null hypothesis or not. As well, for most optimization experiments in the context of “traditional” growth hacking the cost of being wrong and having to revert your changes is usually rather minimal apart from the loss of the growth you would have had had you not made the bad decision, especially when compared to something like medicine where an incorrectly interpreted experiment could lead to deaths or science where an incorrectly interpreted experiment could lead to misinformation. If you’ve come to the belief through collecting more data or doing additional analyses that you picked the wrong button color or the wrong landing page design then it’s easy to revert your changes to the old button color or the old landing page. However, I must note that it is always better to err on the safer side, especially if you have the luxury of waiting for your experiment to reach statistical significance at the standard levels or a higher enough traffic volume.

One major issue with using the standard rules of statistical significance is that when you’re working on a growth team you want to iterate as quickly as possible and attain optimal values for different parts of your application as soon as possible so that your team can reach your (likely) ambitious growth goals. This fast iteration cycle is also highly valuable for smaller companies or even for larger companies when dealing with new features and attaining reasonable levels of statistical significance takes potentially weeks due to low traffic volume when the business requirements require a much more constrained timetable.

Unfortunately I don’t have a magical answer dictating that there is some magical goldilocks “just right” level of statistical significance that is good for iterating quickly and not burning yourself from being fooled by random variation. However, I believe it’s important that growth teams understand that the level of statistical significance they decide to use should be viewed as a variable they control and understand the risks of taking a lower level of significance. Even for experiments geared more towards learning, it’s possible to take on additional risk by lowering the level of statistical significance to iterate faster. As well, it’s important to understand when designing experiments if they err more on the side of being a “learning” experiment or an “optimization” experiment and treat the former like a classic experiment in the manner of Fisher whereas the latter more like a stochastic optimization problem.

These are just broad musings that I have gleaned from working as a growth engineer on Sidekick at HubSpot.