Bayesian A/B Testing WhatsApp Messages
In this article, I will walk you through the process of performing an A/B testing analysis from the perspective of Bayesian statistics to determine the best WhatsApp message template to send to customers.
To be fair, there is no difference if this is about a website landing page, a product feature or WhatsApp messages. The logic of A/B testing can be applied to any experiment. However, because I work at Sirena, I find it appropriate to use the example of WhatsApp messages.
Context about the problem
Let’s set up the case for some context. You are an online retail business operating through WhatsApp in the modern era. For the purpose of this article, let’s say you get 10K messages per working week, most of them during the day and so your incoming WhatsApp traffic might look something like this:
Once your contacts reach out to you, a Sirena Bot contacts them right away. After your bot talks to them, they get transferred to one of your sales reps and the sales process begins. The question is, which message should your sales representatives send first?
Ideally, you already have a bunch of message templates (highly-structured messages or HSMs) that your sales representatives can use. Probably they are currently alternating between a few options, but not really sure which one works best.
Let’s find out!
Presenting the data
Obviously answering this question will require us to dig into our data. For the purpose of this article, I have generated a random sample of data that simulates 10K incoming WhatsApp messages. The dataset is publicly available in my Github: FakeData/WhatsAppMessages
In the data, we can also find the result of sending a given HSM (message template) to an incoming message. The result can be twofold; either the contact converts or they don’t. The meaning of conversion here is not extremely relevant, you can think of it as something extremely simple, such as getting a reply, or something more elaborate as getting a positive reply determined by a sentiment analysis model. In any case, the result boils down to a 1 (converted) or a 0 (not converted).
The flexibility of this dataset is that we can choose what message template to “send” to each incoming message and it will tell us what was the result. Pretty neat!
Assigning templates to messages
If you are not familiar with A/B testing, the basic idea is to randomly choose which treatment (A or B) to give to an individual. It’s a form of a controlled experiment where some fraction of the people get one treatment and another fraction gets another treatment. In the end, the objective is to determine which of the treatments is the “best”.
In our case, our treatments are different message templates (HSMs) and because we want to start simple, we will randomly assign each incoming message to one of the templates in a uniform manner. This means that at the end, we should have roughly 1/3 of the incoming messages being answered with HSM₁, another 1/3 with HSM₂ and finally, 1/3 using HSM₃. If you know anything about multi-armed bandits, you might recognize that this is not optimal. For now, bear with me and I’ll write about multi-armed bandits in another post.
Let’s assign those message templates:
If you recall, the dataset has data on how an incoming message would react to each message template:
Now we are ready to explore our data.
Exploratory data analysis
Now that we have assigned HSMs to every message, we can take a look at the data and explore the results. I won’t focus too much on this part since the point of this article is the Bayesian analysis itself. However, it’s important to check what the empirical conversion rates are for our sample:
Total Conversions: 681Overall conversion rate: 0.0681 Conversion rate HSM 1: 0.0336
Conversion rate HSM 2: 0.101
Conversion rate HSM 3: 0.0774
We can see that in our sample, the best template was HSM₂ with a conversion rate of about 10%. The worst conversion rate was for HSM₁ with a conversion rate of roughly 3.4%. The difference between these two templates is extreme, but if we take a look at HSM₃, with a conversion rate of 7.7%, one can wonder if there is really any difference between HSM₂ and HSM₃.
Bayesian analysis
Now let’s move to the fun part. We want to understand how different the conversion rates for our templates actually are. The question is: is HSM₂ really better than HSM₃? Or was this just by chance and an artefact of our sample?
To analyse this, I’ll take a Bayesian approach. I recognise most people will be familiar with the Frequentists approach (such as hypothesis testing) and I’ll probably write about it in another article. However, for the last couple of weeks, I’ve been obsessing about Bayesian statistics, so we’ll use that. Specifically, I will use PyMC3 to model the problem and obtain posterior distributions of our conversion rates for each HSM.
To start, we need to create a model for our conversion rates. The beautiful thing about Bayesian statistics is that it lets you incorporate prior knowledge or domain expertise into your analysis by using priors. In this case, let’s not make any bold claims about our conversion rates. We’ll use uniform distributions for the two conversion rates to represent how we have no clue what those conversion rates are.
The next step is to define our delta, which we’ll define as the difference between the first conversion rate and the second one. This can be interpreted as the gain in conversion rate that HSM₂ gives us compared to HSM₃. Last but not least, we use the conversion rates from our uniform distributions to model the conversions we observed. To model these conversions, we’ll use a Bernoulli distribution that takes values that are either 0 or 1 with a certain probability (our conversion rates).
After sampling from our model, we can plot some diagnostics:
From these plots, we can see that a priori, HSM₂ has a conversion rate of about 9.6% and HSM₃ has a conversion rate of 7.7%. Additionally, we have the posterior distribution of deltas or the distribution of how much better HSM₂ is compared to HSM₃. All in all, it appears our model has converged and is looking good. Now, let’s explore those posterior distributions.
Notice that our Bayesian analysis has yielded distributions as a result instead of point-estimates. This represents our uncertainty about our estimations and is a perfect way to communicate results with business stakeholders. For example, looking at the distribution, we can see that the conversion rate of HSM₂ is anywhere between 8.7% and 11.2% while the distribution of the conversion rate of HSM₃ is between 6.2% and 8.7%.
At the same time, we can analyse the distribution of delta (the difference in conversion rates). We see that the difference lies between 0% and 5%. This means that in the worst-case scenario, HSM₂ has no difference in conversion from HSM₃. However, there is a high probability (86%) that HSM₂ can increase our conversion rate by 1.25pp or more!
Conclusion
In this article, I’ve shown you how to perform A/B testing using a Bayesian approach to determine which WhatsApp message template is the best. The concepts I discussed in this article are equally applicable to any other A/B testing problem; landing pages, product features, CTAs, emails, and so much more! However, there are plenty of things I did not discuss in this article, such as my choice for the priors, conjugate priors, multi-armed bandits and more. I intend to get to discuss all these interesting topics in future articles.