Beyond the p-value

The idea of “doing science” may evoke ideas of lab coats and beakers, but we actually spend a lot of time thinking about the scientific process here at HomeAway.

Like most technology companies, we are constantly seeking the answers to questions such as “Which page design most encourages customers to make a purchase?” or “How should we price this service to optimize revenue?”. The answers to these questions often come in the form of A/B testing.

A/B testing is based on statistical hypothesis testing, which helps us test and learn in a disciplined way that is free from human bias. The reason that A/B testing is so suited to the tech world is that it is generally easier to get a larger sample of subjects on the web than in fields like medicine and psychology where the experiments take place in person. Believe it or not, if you spend any time browsing social media or online shopping, you have undoubtedly been part of many of these tests.

Why We Test

To mitigate risk

When we experiment, it is to avoid implementing a change that could make things worse or has other unintended consequences. Implementing an unsuccessful change without testing first could lead to conversions tanking, wasted implementation resources, and even angry customers who vow to never use our service again.

People often rely on intuition and anecdotal experience in decision making, which often works quite well, but is probably not ideal for risky ventures like changing the way that users register or make purchases on your website. After all, sometimes A/B testing can have suprising results.

The software startup Zapier confusingly saw an increase in conversion when users had to scroll down the page to find out what the company does and who its target audience is. The team’s assumption was that the initial version of the page, which had all of the information at the top to draw customers in, would perform better. This underscores the point that while it is tempting to play amateur psychologist, it is usually hard to predict how humans will behave.

To learn something

A/B testing doubles as a way to learn more about our customers and how they interact with our products, which is always valuable to any business. This type of testing allows us a clear view into the way customers think and what they value, which on top of just providing insight into how to design the website, can also help guide our entire philosophy and business model.

…and to innovate faster

In addition to learning something, scientific testing helps us create a disciplined approach to designing and implementing new features. Once we remove the influence of personal opinions and anecdotal evidence, we learn what works and what doesn’t faster and speed up the process of creating new ideas.

What is a P-value?

There are a multitude of ways to do A/B testing, but I want to focus on the one encountered most frequently at HomeAway and many other tech companies: traditional (frequentist) A/B testing.

In this type of A/B test, we choose the metric (or metrics) we want to optimize. Common metrics include things like clickthrough rate and conversion rate. The next step is to randomly assign members of the experiment to either the control group or the treatment group. Typically, the users in the control group experience no change to their experience using your product while the treatment group are exposed to a new feature or changed user interface. The random assignment is important because if the treatment group is fundamentally different from the control group in some way, the effect of the treatment could be masked. Any significant change that we identify may not be “real”, or vice versa.

The usual metric used to judge the performance of an A/B test is the p-value. The p-value is the probability that, assuming there is actually no difference between the treatment and control, you would observe results as or more extreme than what you observed in your experiment. A small p-value (historically anything less than 0.05 is considered good enough or “significant”) signifies that there is a very small chance that the effect that we observed was due solely to chance. Seeing a small p-value allows us to separate the signal from the noise and suggests that we have strong evidence to conclude that there is a significant difference.

In basic A/B testing, calculation of the p-value is based on the normal distribution and the Central Limit Theorem. If you are interested in digging into the math, this is a good resource.

What It Doesn’t Tell Us

Although the p-value can be an excellent tool to help us quantify whether an effect is “real” or just the result of looking at a sample that is too small or too variable, there is a lot that it doesn’t tell us.

For example:

  • The probability that the treatment has no effect in general: the p-value is tied to your pre-designed experiment ONLY and does not incorporate any outside information (maybe your competitor already had success with this change) or correct for errors in experiment design (maybe your sample size was too small to detect a change).
  • The long-term result of the change: perhaps a change will appear to have a negative impact initially, but will become profitable over time as users get used to a new feature and industry trends develop. Remember Henry Ford’s saying: “If I had asked people what they wanted, they would have said faster horses”.
  • Whether implementing the change is worth the cost in money, time, infrastructure, etc.

Strong Evidence of a Very, Very Small Effect

When we think about making a change to our site like altering the flow of the booking process or even something as minor as changing the color of a button, there are a ton of considerations that evade capture by the crisp, simple p-value.

An important one is what the short and long-term impact of the change will be. Looking at the effect size (a measure of the difference in performance between the treatment and control) can help us here. The p-value is sometimes interpreted to be related to the impact of the change, but in reality a highly significant result can often be attached to a tiny effect size when you have a large sample size.

Therefore, it is important to consider both metrics when deciding whether to go forward with the results of a test. It may come to light that your change is highly significant but it only leads to 0.0001% lift in conversion rate. It’s up to you to decide whether it is worth the engineering resources or not.

Another facet of this problem is that there are real world risks and rewards that come as a result of your decision. The p-value can help us avoid dedicating valuable resources to a false positive, but it isn’t a substitute for the experience of domain experts. In the case where there is strong prior evidence of an eventual positive outcome, it may be worth it to proceed with the change anyway or rely more heavily on other methods of testing. On the other hand, when a decision is incredibly risky and the costs of making a mistake are high, a highly significant p-value may not be enough.

Learning Resources

If you want to learn more about the science of A/B testing, experimental design, and the pitfalls involved in working with p-values, check out these resources!

Practical Guide to Controlled Experiments on the Web

Lessons from Running Thousands of A/B Tests

A/B Testing Statistics Crash Course: Ignorant No More