3 things I learned about A/B testing

Matthias Suter
doodle-tech
Published in
4 min readFeb 7, 2019

A/B testing became a crucial tool in our process and I want to take a moment to reflect on the things I learned along the way.

I take it as a given that A/B testing is an excellent way to test new features. Therefore this post is not about where A/B test makes sense and where not but I will focus on the three most important learnings I made during my journey to become data informed.

Size matters

It is important to put some thoughts on what exactly should be tested. While about 100k polls are created each day, we only get 50 subscriptions on average per day. Traffic to test with can get a sacred good, more so as we are running more and more tests and want to move even faster (did I hear lightspeed?).

Best is to have a crisp hypothesis and a variation which addresses exactly and only the things which should be tested. This sounds logical enough but can easily go wrong. One time we tested a completely overhauled pricing page just to find out that in general, it did not perform better than the old one. The main problem was, that this was also the only real outcome. We did not really gain more knowledge with this test (except that the new page did not perform better in our test). In another experiment, we only tested where the user expects a new feature instead of understanding what the feature should be. In retrospect, it was too costly to spend 2 weeks running this experiment, more so it did not give any real insight.

It is good practice to let tests run at least two weeks to avoid selection bias. This is true even statistical significance is reached before that time. If you let your tests run shorter than one cycle (which in our case is one week) you risk having data which is not representative due to insufficient sampling. Doodle’s traffic and user base, for example, differs on the weekend from weekdays. For that reason, we usually let run tests at least two weeks to cover our cycle twice. This helps to flatten “special” days and other anomalies.

Usually, we also do not run tests much longer than two weeks. Longer tests can become non-representative due to other changes like marketing campaigns and seasonal effects. Also, the change that a user deletes his cookie, access the page and sees another version gets higher. More importantly, if there is no winner after two weeks even there is enough traffic, most likely the change in the experiment just does not lead to different behavior. In these cases, it is better to go back to the drawing board and test either another hypothesis or the same hypothesis but with different variations.

Go for a win

A/B testing should not be a replacement for making a decision. This was my biggest realization and is in my opinion also a common misconception. It is true that A/B testing (or testing in general) is a great way to avoid lengthy and pointless discussions where people mostly just exchange opinions. I also believe that most of the time it is easier and more efficient “just to test” something instead of making big concepts based on a lot of untested assumptions.

But I do not think that A/B testing should be used to avoid decisions in the design process and just test all possible variations one can think of. I see it more of a way to design and implement one (possibly small) change which the teams believe in to either prove or disprove an assumption. In most cases, it is faster and less work to test a well-thought variation instead of implementing a bunch of different variations to see which one will win (I call this the no-Darwinism-theorem).

Be patient

As already outlined in the first point, A/B tests should run at least 2 weeks. Therefore it also does not make sense to check the results every day and start drawing conclusions based on non-significant data. Otherwise, you start to sound stupid by talking about “trends” and “first insights” (been there, done that and most likely will do it again…). Best would be just to ignore the results for two weeks, but…

… that said, you should still check your A/B tests on a regular basis. Mainly to spot problems early and to get a feeling if you are likely to end up with a result at the end of the test phase or not. We always start experiments with low traffic (depends a bit, but usually 10% of the traffic) and let it run for 24 hours to see what goes wrong. If nothing goes wrong (or we do not realize that we have broken things, also happened) we go to 100% traffic.

At the moment we are a bit optimistic about running different test at the same time. We make sure not to have obviously conflicting tests but there is still room to improve. Ideally, we would make sure that every user only sees exactly one experiment. But to achieve that we need to get smarter with segmenting users and allocate the minimal needed traffic to each experiment. I hope this will be part of my next blog post (as soon we found out how to do it though…).

I hope you enjoyed the read although it did not contain any cat pics and please use the comment section to express disagreement or contribute with your insight!

--

--

Matthias Suter
doodle-tech

Senior Product Manager working at Doodle and responsible for the web version (www.doodle.com) with over 24mio monthly active users