Do Warm or Cold Filters in Your Pictures Drive More Clicks? — A Machine Learning A/B Testing
A/B test on ads is the art of choosing the best advertisement that optimizes your goal (number of clicks, likes, etc). For example, if you change a simple thing like a filter in your pictures you will drive more traffic to your links.
In this post we will see how to do evaluate by yourself the efficacy of your own A/B test using Python and Bayesian Statistics.
Article TL;DR for the Busy Data Scientist
- I ran two twitter ads. The one with a cold filter outperformed the one with warm filter.
- Skip to section “The Code (starting from the end)” if you just want to copy and past some Python code to try your own A/B Test.
- Skip to the section “Putting all together” if you want to see my final result.
- Read it through if you want to learn how to do and why it works A/B Tests using Bayesian statistics.
The Problem
Given two ads (A and B), which one has the highest Click-Through-Rate (CTR)? From a statistical point of view, the problem is to learn the unobservable TRUE CTR parameter from the observed data of impressions (ad views) and clicks. Just to avoid confusion, remember that CTR is calculated as the number of clicks divided by the number of impressions:
From this simple formula you might be thinking:
“If the equation is really just a simple division the only thing I need to do is to get the data from the ad performance , do the simple division and the higher number is the best Ad. DONE!?”
Well, not really.
Say you have only one impression and one click, that is a CTR of 100%, but you should not be assuming that your TRUE CTR is actually 100% from one single view/impression. It most likely is not 100%, by the way. To put it simply, the observed CTR alone cannot tell us the performance of an ad. We need more, we need more data and we still need some Bayesian statistics. But before let me setup our our ad campaign.
The Setup
Let us explorer the problem a little bit further. For that matter, I ran two real ads in Twitter with real money (I spent $20 USD in case you are curious). The difference between both ads was only the filter in the images. One image had a cold filter and the other had a warm filter (default filters in my smart phone). The ad was for an Affiliate Amazon link to the book ‘Designing Data Intensive Applications ’(https://amzn.to/3iycLi6).
The question is ‘Which filter maximizes my the Click-Through-Rate?’, i.e., which image makes the Amazon link most likely to be clicked? Here are the ads side by side:
After running both ads for a few hours I got the following impressions and clicks:
Ad A (cold filter): 190 impressions, 13 clicks, CTR 0.068 (6.8%)
Ad B(warm filter): 143 impressions, 9 clicks, CTR 0.062 (6.0%)
From the data, we can see that A’s observed CTR is higher than B’s observed CTR (6.8% > 6.0%). The remaining of this post we will answer the following two questions:
- Can we conclude that A’s real CTR is higher than B’s real CTR?
- If we were to accept it is (or it is not), what is the probability that we are right or wrong?
One way to stretch this question to the extreme would be the following hypothetical situation: Assume that ad A had 1 view and 0 clicks while ad B had 1 view and 1 click. Their observed CTR are 0% and 100%, but no one would say that ad B performs better than ad A just from these data points. Let’s see how to estimate our CTRs while our ad campaign is running.
The Code (starting from the end)
It will be easier to understand where we are going if we start at the end. So, let’s have a look at some code. Do not worry if you do not understand every line yet, the main point now is the plot the code produces, also I will walk you through the code later.
Running this code in your notebook, you should get the following plot:
This plot shows the most likely values for each CTR given the observed data. Since the regions with the most likely values (the Highest Density intervals above) we cannot say that one ad is better than the other only, at least not with only this data. For comparison, see the same code if it were ran with the following FAKE data:
We will perform our A/B test in these 3 simple steps. We will:
- Gather the number of impressions and clicks of both ads;
- Plot the distribution of the most likely values for A and B’s CTR;
- Compare their highest density intervals and decide depending on the intervals’ overlap.
If you paid attention to the Python code above (no judgement if you did not), you saw that we used a few keywords, most importantly prior, posterior and beta. In the next section we will jump to the math and explain why the previous Python code works. So buckle up, enter the Bayesian statistics.
The Math behind A/B Test
The Prior
The goal here is to estimate the real CTR for an ad given the observed data. Since seeing the REAL CTR would imply serving an ad to every single user of Twitter, it is monetarily (and often practically) impossible to do that. So we will need to make our 20 bucks worth it. The estimation will come with an uncertainty that we will be able to quantify, that is also important for our decision making when it comes to stop an ad and increase the budget on others.
Let’s call θ the parameter we want to estimate (in this case the CTR of an ad) and p(θ) the probability distribution for θ. In a previous article, I talked about p(θ) being discrete or sometimes uniform, but here we will see that using the beta distribution to describe p(θ) makes sense and it convenient for computation.
One way to think about p(θ) is to think as if it is our belief on the possible values of θ. For example, say we are talking about a problem that we have no prior knowledge, and all possible values of θ are equally likely. In this case, note that beta(1,1) describes our knowledge (or lack of) for possible values of θ. By the same token, if we were to use θ to describe our prior belief on the probability of a coin flip to be be heads, the distribution beta(25,25) would be a good candidate, as it is centered around 50% , but we would still allow some room for small biases (HDI spans within 0.4~0.6). See figure below:
The parameters a and b can seem a little artificial, but away to describe a particular prior of a problem is via the prior mean m and the number of events n observed in the past. With m and n one can find the parameters a and b as follows:
It is important to note that a beta distribution might not be the best fit for the problem, for that we could use model selection via Bayes Factor, a topic way beyond this article. Historically, the beta distribution is used because it is very easy to compute, specially when we use it with the bayesian rule given some new data. Let’s have a look on that next.
The Posterior
The way we will do our A/B test will be assuming the prior distribution for our parameter θ (the real click-through-rate of the ad) to be beta(1,1), i.e., we will assume total lack of prior knowledge and set any possible value of θ to be equally likely. Then, we will update our p(θ) given our observed data, for that we use bayes rule (For an introduction to the Bayes’ Rule see my previous Article here):
Our goal will be to calculate the posterior distribution p(θ|D) for our advertisement given the number of impressions and clicks the ad received.
First, note that we are assuming that given an ad, the probably of clicking it when it is seen is equal to θ and the probability of not being clicked is equal to 1-θ, that is what we call a Bernoulli distribution.
Result: After N impressions and z clicks, if the probability of the data (N, z) is a Bernoulli distribution, and if the prior distribution of θ is a beta(a,b) distribution, then the following holds to be true:
For a detailed proof of this result I suggest John Kruschke’s “Doing Bayesian Data Analysis” (https://amzn.to/345ouR2), or ask me in Twitter (@solvingthehuman).
Putting all together
With the equation from last section we can say that after N impressions and z clicks in an ad, the updated distribution of the ad’s CTR (the θ parameter) is given by:
beta(1+z, N-z+1)
Let’s see how this look like with some real data along with the complete Python code so you can do your own A/B tests:
Conclusion
In this article we saw how to use the beta distribution and Python to quickly decide which ad is performing better in an A/B test campaign.
Also, we saw that if you are planning to sell data science book on Twitter you might be better off using the cold filter for your pictures. I am very curious to know if this result holds up to other people.
Let me know of your own A/B tests! Did you have similar results as mine? I am in Twitter @solvingthehuman.
More?
If you want to read: