A Data-Driven Marketers’ Guide to Calculating Statistical Significance

Mark Rogers
SpyFu’s Testing Ground
9 min readMar 23, 2020

Data-driven marketers live by the mantra, “always be testing.” Landing pages, email subject lines, ad copy, blog images, and anything else with a variable should be A/B tested, and then A/B tested again.

When marketers are A/B testing, they often miss out on one important metric: statistical significance.

Statistical significance will tell you if the results you got from an A/B test are actually indicative of future performance or if those results could be an anomaly.

Say you’re running an A/B test of a landing page. Both versions of the landing page had 1,000 visitors. Landing page A had 90 conversions, giving it a 9% conversion rate. Landing page B had 111 conversions, giving it an 11.1% conversion rate.

Just looking at the conversion rate, it seems like landing page B is the winner. Unfortunately, that result isn’t actually statistically significant. If you were to run this experiment again, you might get a completely different set of results. There’s a chance this one test was a fluke because it didn’t have enough data for you to make a statistically significant decision.

That could be a big problem if you make your decision based on this one test. You might end up with a worse result after running a full campaign using landing page B.

Mistakes like that are ones marketers can’t afford to make. In a world where we live and die by the numbers we produce, it’s important to make sure the data we’re basing decisions on is accurate.

What is statistical significance?

Statistical significance is defined as, “the likelihood that a relationship between two or more variables is caused by something other than chance.” To put that a simpler way, statistical significance is a measure of how confident you can be in the results of your study.

A great way to think about statistical significance is to imagine you’re running the same exact experiment 100 times. If you end up with the same result 95 out of those 100 times, your result is statistically significant.

Of course, you don’t want to run the same experiment 100 times to figure out statistical significance. That would take too much time. That’s why there’s a formula to calculate statistical significance.

The formula for statistical significance is complex. It looks something like this: s = √∑((xi — µ)2/(N — 1))

Fortunately, you don’t actually need to know that formula to calculate statistical significance. There are a number of statistical significance calculators that make the process easy. My go-to is VWO’s statistical significance calculator built in Google Sheets (you can copy it to make your own version).

For an experiment to be considered statistically significant using the formula, you need to set a benchmark for your confidence level, or p-value, to be above 95%, in most cases.

Once you hit that level, you can be confident that the results of your test aren’t because of chance, which means if you run the experiment again, you’ll get the same result. Once you have that high of a confidence level, you can make an accurate decision based on the data.

Understanding the p-value

P-value is what makes or breaks statistical significance. It’s basically the number you use to set the bar. Your p-value is what determines if you can say, “yes, I’m 95% confident that I’ll get the same results if I were to test this again.”

The p-value is a number that you set, and it depends on how much margin for error you want to build into your statistical significance calculation.

In most academic research settings, the p-value is set to be less than 0.05. In highly technical scientific experiments where a wrong decision can be life or death, they use a much smaller p-value — often less than 0.001.

For marketers, a p-value that low requires a massive sample size. It usually needs to be so large that it will either take you months to get a sample size big enough, or you’ll never achieve it at all.

Here’s a landing page A/B test where the marketer set the p-value to be 0.001. Notice how many visitors this A/B test required to reach a p-value of less than 0.001:

That level of statistical significance required 220,361 visitors between the two pages. Unless you’re an enterprise company, you probably won’t get that many visitors for months.

Fortunately, since marketers don’t deal with life or death data, we don’t need to use p-values that low. We can set our p-values as high as 0.08, which means you’ll need a smaller sample size and a shorter time period to prove statistical significance.

P-value is easier to understand if you convert it to a percentage because it technically is a percentage. If you set your p-value at 0.05, that’s the same as saying, “I’m willing to give this experiment a 5% chance of being wrong.”

If the results of your experiment give you a calculated p-value of 0.046, that equals 4.6%, which is statistically significant in an academic research setting. That means you can say that you’re 95.4% confident that you’ll get the same results if tested again.

Why marketers should care about statistical significance

Statistical significance matters for marketers because we often base decisions on data. If you’re basing decisions on data that doesn’t have a high enough statistical significance, there’s a chance you’re making a bad decision because the data could end up being an anomaly. If you run the same experiment again, you might get a different result.

Basing decisions on data that isn’t statistically significant can have a big impact on your bottom line. Let’s say you’re running an ad campaign on Facebook. You have two different versions of the ad, and you let both versions run for 48 hours.

Here are your results:

Ad A:

  • Impressions: 5,984
  • Clicks: 147
  • CTR: 2.46%

Ad B:

  • Impressions: 5,697
  • Clicks: 161
  • CTR: 2.83%

It looks like Ad B is the easy winner. Ad B had more clicks on fewer impressions than Ad A, but those results aren’t statistically significant because the calculated p-value is 0.106.

Regardless, you didn’t take the time to calculate that and decide to move forward with Ad B.

Ad B runs for two weeks and spends $10,000. Since you’re a busy marketer who ran an A/B test, you don’t check in on it until the end. It turns out the results from your original A/B test were an anomaly, and the rest of your campaign underperformed:

  • Impressions: 377,581
  • Clicks: 7,518
  • CTR: 1.99%

Had you just stuck with Ad A at its CTR of 2.46%, you would’ve gained approximately an additional 1,772 clicks. That’s not an insignificant number of clicks. All because you didn’t check for statistical significance during your initial A/B test.

What you should do if you get into a situation where your A/B test isn’t providing statistical significance is either continue testing until you have enough data to prove significance or go a different direction with your A/B test altogether.

How to use statistical significance in marketing

To be a true data-driven marketer, you need to start using statistical significance to make your decisions.

Here’s a full example, so you can see this in action:

Say you want to A/B test headline copy on a landing page.

  • Landing Page A’s headline is “Use Data to Skyrocket Your Facebook Ads Results”
  • Landing Page B’s headline is “Join 10,000 Other Marketers Using Data to Skyrocket Their Facebook Ads Results”

Landing Page A’s headline is what you’ve been using for the past three months. So you’re trying the B split to see if you can get a higher conversion rate with it.

You note your hypothesis, which is: “If I introduce social proof into the headline, then conversion rates will increase.”

Now, you can set your p-value. You choose 0.05 because you want to be very confident that your results will be accurate.

At this point, you’re ready to start your test. You start sending traffic to both pages, and after seven days, you have these results:

Landing Page A:

  • Visitors: 10,247
  • Conversions: 103
  • Conversion Rate: 1.01%

Landing Page B:

  • Visitors: 10,114
  • Conversions: 127
  • Conversion Rate: 1.26%

It looks like Landing Page B is the better performer based on the conversion rate, but let’s test for statistical significance.

Using a statistical significance calculator, your calculated p-value is 0.0455, which is less than your benchmark p-value of 0.05. That means you’re 95.4% positive that you’ll get the same results if you were to run this test again. That makes this test statistically significant, meaning your hypothesis is supported.

You can now confidently say that you’ve made the right decision by introducing social proof into the headline. It’s time to turn off the traffic going to Landing Page A and direct it all to Landing Page B.

What if you never find statistical significance?

There’s a definite possibility that you’ll run an A/B test where one version looks like it’s performing better, but according to your statistical significance calculation, your results aren’t significant. So you keep running the test, but after a few more weeks, your results still aren’t significant.

What are you supposed to do?

You have a few options here. There’s a chance that your variables are just too similar to give you any significant results. If you suspect that’s the case, you should restart your test with two different variables.

Your second option is to use practical significance. Practical significance can be used when the difference is large enough that it will make a difference in the real world, even if it isn’t statistically significant.

For example, let’s say you’re running a Facebook ads A/B test. After a few days, these are your results:

Ad A:

  • Impressions: 5,984
  • Clicks: 147
  • CTR: 2.46%

Ad B:

  • Impressions: 5,697
  • Clicks: 161
  • CTR: 2.83%

Ad B looks like the winner based on CTR, but the p-value is 0.106, meaning it’s not statistically significant:

So you continue to run the A/B test for another week. After another week’s worth of data is collected, your results are similar. Ad B continues to have a higher CTR, but performance has slipped on it a little bit:

Ad A:

  • Impressions: 59,840
  • Clicks: 1,470
  • CTR: 2.46%

Ad B:

  • Impressions: 56,970
  • Clicks: 1,465
  • CTR: 2.57%

Your p-value is now 0.104, so you’re a little bit closer to statistical significance but not close enough to call it statistically significant:

This is where practical significance can come into play. If you convert that p-value to a percentage, it gives you 10.4%. That means there’s an 89.6% chance that you’ll get the same results if you repeat the experiment.

In this case, you can use practical significance to say that Ad B is the higher performer. It’s not statistically significant, but in a practical sense, you have a high enough sample size to be relatively confident that your results will stay the same.

For practical significance to work, you need a large sample size and a long period of testing. It’s rare that you’ll have to make a decision based on practical significance, but it is something to keep in mind.

Use statistical significance to be data-driven

The biggest step to being a data-driven marketer isn’t collecting more data. It’s actually learning how to use data to make accurate decisions.

Making accurate decisions means using statistical significance to prove that your data decisions are accurate and will give you the same results time after time.

--

--