A/B Test Gotchas

Published in

Snagajob Engineering

5 min readJul 15, 2016

I am someone who is interested in data, and making decisions based on data. So obviously, I’m a firm believer in A/B tests. I believe most changes to our website should be A/B tested to verify that they actually have a positive impact. Over time, I’ve found there are some “gotchas” to keep in mind when A/B testing.

Some of these are issues with the testing tool that is used, and some are just issues with how the test is set up and run. At Snagajob we have used various testing tools over time, and currently have two in use. One of these is one that was built internally, so we aren’t stuck dealing with any tool-related issues — we can just update the tool.

So here are some things to keep in mind when running tests. If you’ve got any more to add, please share them in the comments.

Make sure to look at statistical significance, not just change rates. You might be super hyped about your test because it increased the metric you were targeting by 25%. But was that really statistically significant? If it increased from 4 to 5, the numbers are probably too small to be meaningful. If there’s huge variance in the numbers, even a jump from 1000 to 1250 might not be something you can rely on. You should be using a testing tool that tells you the confidence you can have that a metric really has changed, and have an internal standard on what confidence you require. Would you make a change to your site if you were 75% confident that it improved conversion? 95%? 99.9%?
Avoid noise in your user population. Make sure you are only assigning users to your test — variant or control — if they are actually using the behavior being tested. It’s easy to fall into the trap of assigning users to test groups when they land on your site. But if you are testing a change on a page that only a fraction of your users actually see, you should only assign users to the control or variant when they actually land on that page. Otherwise you will have so much noise in your data that it will be hard to identify any statistically significant change.
Don’t count bots and other outliers in your test results. And remember that not all bots behave well. Googlebot can be identified fairly easily, and excluded from the test. The scraping tool that someone wrote using Selenium, however, might just look like another user hitting your site in Firefox. So be sure to look for, and ignore, statistical outliers in addition to clearly identifiable bots, when analyzing your results. Otherwise you might end up thinking page views are way up in the variant because a bot happened to land in that group and hammer your site.
Look out for novelty effects. Let’s say you’ve got a blue button on your site, and you want to test whether it gets more clicks if you change it to red. The button is on your home page and gets a lot of traffic. You put up the change, and 12 hours later you’ve got a ton of data showing that people seeing the red button clearly click more. You are good to go, and make that change permanent, right? Maybe not. People might not be clicking on it because it’s red, but rather because it’s a different color than it has been the last 20 times they hit the site. It’s new so it jumps out at them. Maybe the best thing to do is really to show a new button color every day, not to show red all the time. You can account for something like this in a couple of different ways. One is to run the test for a long time. Does that high conversion rate stay high a week later? How about a month later? Another option is to avoid this problem altogether in user selection. If you only assign new site users to the control and variant, they don’t know anything about what color the button was yesterday, they are only responding to what they see right now.
Look out for bias in how you assign users. If your site has a decent number of users who don’t have javascript enabled, but your testing framework is javascript-based, your tests will never include a representative sample of your users. If you redirect users to a different starting page depending on whether they are logged in or not, and you assign them to a test group on just the landing page for unrecognized users, you won’t have a representative sample. Depending on the test, this may be okay. If you are testing a new feature that will only be accessible to users with javascript, or to unrecognized users, then that’s totally appropriate. But make sure you are aware of any limitations in your user base and how they might impact the validity of your tests.
Make sure to look at the complete impact of your test, not just the metrics you expect to impact. For example, you might make a change that you think will increase page views. More page views are better. You run the test, you look at page views, and they are up 10%. Everything is peachy, right? Maybe not. What if you also decreased sales in the process? Or new user registrations? It may not be worth the trade-off. It’s important to track all your key metrics across each group, so you can see these unintended effects, not just the changes to that one metric you are trying to target. A good testing tool will do this automatically.
Make sure you define a test “user” appropriately. In the employer suite of products, Snagajob has individual users that are part of organizations (i.e. companies who employ hourly workers). If I am testing a UI change, it might make sense to assign people to groups as individuals. But if I’m testing a new feature, it might make more sense to assign them at the organization level. If half the people in an organization can use a new feature, and the other half can’t, the feature might not work as expected or benefit them in the same way it would if they were all using it.
Not everything can be A/B tested. If I believe making a change to the homepage will improve SEO, that can’t be A/B tested. It only matters what view the bot gets, and if you give them an inconsistent view, it definitely won’t work well. You’ll need to look into other solutions for that. (Note: if you want to A/B test something like SEO for item detail pages, there are ways to do that. Check out this article for a good description of how to do that.)

A/B Test Gotchas

Written by Maria Gullickson