Do Warm or Cold Filters in Your Pictures Drive More Clicks? — A Machine Learning A/B Testing

Data Ninja
Solving the Human Problem
8 min readAug 17, 2020
Photo by Robert Anasch on Unsplash

A/B test on ads is the art of choosing the best advertisement that optimizes your goal (number of clicks, likes, etc). For example, if you change a simple thing like a filter in your pictures you will drive more traffic to your links.

In this post we will see how to do evaluate by yourself the efficacy of your own A/B test using Python and Bayesian Statistics.

Article TL;DR for the Busy Data Scientist

  1. I ran two twitter ads. The one with a cold filter outperformed the one with warm filter.
  2. Skip to section “The Code (starting from the end)” if you just want to copy and past some Python code to try your own A/B Test.
  3. Skip to the section “Putting all together” if you want to see my final result.
  4. Read it through if you want to learn how to do and why it works A/B Tests using Bayesian statistics.

The Problem

Given two ads (A and B), which one has the highest Click-Through-Rate (CTR)? From a statistical point of view, the problem is to learn the unobservable TRUE CTR parameter from the observed data of impressions (ad views) and clicks. Just to avoid confusion, remember that CTR is calculated as the number of clicks divided by the number of impressions:

Formula for Click through rate (or CTR for short). Number of clicks divided by the number of impressions.
Click-Through-Rate formula.

From this simple formula you might be thinking:

“If the equation is really just a simple division the only thing I need to do is to get the data from the ad performance , do the simple division and the higher number is the best Ad. DONE!?”

Well, not really.

Say you have only one impression and one click, that is a CTR of 100%, but you should not be assuming that your TRUE CTR is actually 100% from one single view/impression. It most likely is not 100%, by the way. To put it simply, the observed CTR alone cannot tell us the performance of an ad. We need more, we need more data and we still need some Bayesian statistics. But before let me setup our our ad campaign.

The Setup

Let us explorer the problem a little bit further. For that matter, I ran two real ads in Twitter with real money (I spent $20 USD in case you are curious). The difference between both ads was only the filter in the images. One image had a cold filter and the other had a warm filter (default filters in my smart phone). The ad was for an Affiliate Amazon link to the book ‘Designing Data Intensive Applications ’(https://amzn.to/3iycLi6).

The question is ‘Which filter maximizes my the Click-Through-Rate?’, i.e., which image makes the Amazon link most likely to be clicked? Here are the ads side by side:

Almost identical ads I ran in Twitter. Both ads are the same but the filter applied to the pictures. The left (ad A) has a cold filter, while the right (ad B) has a warm filter. Which one do you think performed better?

After running both ads for a few hours I got the following impressions and clicks:

Ad A (cold filter): 190 impressions, 13 clicks, CTR 0.068 (6.8%)

Ad B(warm filter): 143 impressions, 9 clicks, CTR 0.062 (6.0%)

From the data, we can see that A’s observed CTR is higher than B’s observed CTR (6.8% > 6.0%). The remaining of this post we will answer the following two questions:

  1. Can we conclude that A’s real CTR is higher than B’s real CTR?
  2. If we were to accept it is (or it is not), what is the probability that we are right or wrong?

One way to stretch this question to the extreme would be the following hypothetical situation: Assume that ad A had 1 view and 0 clicks while ad B had 1 view and 1 click. Their observed CTR are 0% and 100%, but no one would say that ad B performs better than ad A just from these data points. Let’s see how to estimate our CTRs while our ad campaign is running.

The Code (starting from the end)

It will be easier to understand where we are going if we start at the end. So, let’s have a look at some code. Do not worry if you do not understand every line yet, the main point now is the plot the code produces, also I will walk you through the code later.

Running this code in your notebook, you should get the following plot:

A/B Test from our first Ads. The blue line is our A ad (cold) while the orange is our B ad (warm). The filled region under the plots is what we call ‘Highest Density Interval (HDI).’ It is the area that contains 95% of the distribution of CTR. Also, not that the averages are different than simply clicks divided by impressions. We discuss this later in the post.

This plot shows the most likely values for each CTR given the observed data. Since the regions with the most likely values (the Highest Density intervals above) we cannot say that one ad is better than the other only, at least not with only this data. For comparison, see the same code if it were ran with the following FAKE data:

Example data for A/B test. See that the HDIs do not intercept. In this example one could be confident in accepting that ad A (blue line, cold) has a higher CTR than the B ad (orange line, warm).

We will perform our A/B test in these 3 simple steps. We will:

  1. Gather the number of impressions and clicks of both ads;
  2. Plot the distribution of the most likely values for A and B’s CTR;
  3. Compare their highest density intervals and decide depending on the intervals’ overlap.

If you paid attention to the Python code above (no judgement if you did not), you saw that we used a few keywords, most importantly prior, posterior and beta. In the next section we will jump to the math and explain why the previous Python code works. So buckle up, enter the Bayesian statistics.

The Math behind A/B Test

The Prior

The goal here is to estimate the real CTR for an ad given the observed data. Since seeing the REAL CTR would imply serving an ad to every single user of Twitter, it is monetarily (and often practically) impossible to do that. So we will need to make our 20 bucks worth it. The estimation will come with an uncertainty that we will be able to quantify, that is also important for our decision making when it comes to stop an ad and increase the budget on others.

Let’s call θ the parameter we want to estimate (in this case the CTR of an ad) and p(θ) the probability distribution for θ. In a previous article, I talked about p(θ) being discrete or sometimes uniform, but here we will see that using the beta distribution to describe p(θ) makes sense and it convenient for computation.

One way to think about p(θ) is to think as if it is our belief on the possible values of θ. For example, say we are talking about a problem that we have no prior knowledge, and all possible values of θ are equally likely. In this case, note that beta(1,1) describes our knowledge (or lack of) for possible values of θ. By the same token, if we were to use θ to describe our prior belief on the probability of a coin flip to be be heads, the distribution beta(25,25) would be a good candidate, as it is centered around 50% , but we would still allow some room for small biases (HDI spans within 0.4~0.6). See figure below:

Beta distribution examples. The left distribution is an example of a problem we have total lack of prior knowledge where all possible θs are equally likely. While the right is an example of a coin toss prior, where our prior knowledge assigns higher probability to 50% but with some bias within 40%~60%.

The parameters a and b can seem a little artificial, but away to describe a particular prior of a problem is via the prior mean m and the number of events n observed in the past. With m and n one can find the parameters a and b as follows:

Formula to find the parameters a and b for a prior beta distribution given the previous mean m and the number of events n. Note that, given a and b we can also calculate the mean for the beta distribution.

It is important to note that a beta distribution might not be the best fit for the problem, for that we could use model selection via Bayes Factor, a topic way beyond this article. Historically, the beta distribution is used because it is very easy to compute, specially when we use it with the bayesian rule given some new data. Let’s have a look on that next.

Beta distribution with different values for the parameters a and b. (Source: Wikipedia)

The Posterior

The way we will do our A/B test will be assuming the prior distribution for our parameter θ (the real click-through-rate of the ad) to be beta(1,1), i.e., we will assume total lack of prior knowledge and set any possible value of θ to be equally likely. Then, we will update our p(θ) given our observed data, for that we use bayes rule (For an introduction to the Bayes’ Rule see my previous Article here):

Bayes’ Rule. For an introduction to the Bayes’ Rule see my previous Article here.

Our goal will be to calculate the posterior distribution p(θ|D) for our advertisement given the number of impressions and clicks the ad received.

First, note that we are assuming that given an ad, the probably of clicking it when it is seen is equal to θ and the probability of not being clicked is equal to 1-θ, that is what we call a Bernoulli distribution.

Result: After N impressions and z clicks, if the probability of the data (N, z) is a Bernoulli distribution, and if the prior distribution of θ is a beta(a,b) distribution, then the following holds to be true:

If the distribution of the data is given by a binomial distribution and the prior distribution is given by a beta distribution, the posterior distribution is a beta distribution. When that happens we say that the binomial and the beta distributions are conjugate, and the update of the prior distribution to the posterior becomes a simple arithmetic calculation.

For a detailed proof of this result I suggest John Kruschke’s “Doing Bayesian Data Analysis” (https://amzn.to/345ouR2), or ask me in Twitter (@solvingthehuman).

Putting all together

With the equation from last section we can say that after N impressions and z clicks in an ad, the updated distribution of the ad’s CTR (the θ parameter) is given by:

beta(1+z, N-z+1)

Let’s see how this look like with some real data along with the complete Python code so you can do your own A/B tests:

Real Time A/B test for cold and warm filters. Each plot are days a part, with the third one being two days from the start of the ad campaign. In the beginning we do not have enough data to make up our minds regarding any specific value of ads A or B real CTR. Moreover, we cannot say if A or B is performing better. By the end of the campaign we have almost 95% chance that the real value of Cold filer ad is performing better than the ad with the warm filter.

Conclusion

In this article we saw how to use the beta distribution and Python to quickly decide which ad is performing better in an A/B test campaign.

Also, we saw that if you are planning to sell data science book on Twitter you might be better off using the cold filter for your pictures. I am very curious to know if this result holds up to other people.

Let me know of your own A/B tests! Did you have similar results as mine? I am in Twitter @solvingthehuman.

More?

If you want to read:

  • An introduction about Bayesian Rule — Go here
  • The advantages of Bayesian over other methods — Go here

--

--

Data Ninja
Solving the Human Problem

Focusing on Machine Learning and AI. Solving problems for the humans.