A/B testing and hypothesis testing I

Qiang Chen
Machine Learning and Math
8 min readNov 18, 2018

Foreword

Two different color buttons, using A/B test to find out which color can improve the indicator

One of the uses of the AB test is to verify in the real user environment whether the newly proposed algorithm is better than the previous algorithm, and often used to perform some UI tests, such as A/B test on the different colors of the buttons in the above figure. See which color can improve some metrics. Before conducting online A/B testing, we tend to compare the performance of the two algorithms through some offline indicators, but these indicators are often based on some assumptions, such as online user preference distribution is consitant to the offline data set. The offline indicators can only represent the performance in the real user environment to a certain extent. In order not to make mistakes, it is often necessary to perform A/B test experiments on the new algorithm. Generally, a certain percentage of users on the line can provide the original. The algorithm, called the A group of users, provides another new algorithm to the same proportion of users, called the B group of users. Then if the performance of group B users is better than that of group A users, then the new algorithm performs better than the original algorithm, and can provide new algorithms to a larger proportion of users, and slowly cover all users.

Then the question is how to judge the performance of group B users is better than that of group A users. It is assumed that the performance of users is measured by the length of time spent in the APP. If there are 100,000 users in group A and group B, respectively, group B The user has an average stay time of one hour longer than that of the group A users. Obviously, it can be judged that the new algorithm used by the group B users is better. So if it is half an hour, 15 minutes, 1 minute should be judged? What is the rule for judgment?

We explain some of the relationship between A/B testing and hypothesis testing, and solve the logical process of the backbone part, but do not completely solve all the problems. Later, there will be new articles to explain these problems.

Pre-knowledge

I wanted to write this article a few weeks ago. A lot of key knowledge has been learned and understood in the graduate mathematics statistics class, but I have forgotten it. I spent a week or two reviewing this knowledge, the textbook at that time. I haven’t found it yet. I look at Chen Xiru’s Mathematical Statistics Tutorial. This book is very good.

The goal of the hypothesis test: some random variables obey a certain distribution, such as the duration of the user’s stay in the APP. It can be assumed that the stay duration follows a normal distribution. Based on some observations of the random variable, for example, a group of 10,000 users are collected. The length of stay, the length of stay of 10,000 users in Group B, based on these observations, some assumptions about the distribution are inferred. For example, the mean of the normal distribution obeyed by Group B users is larger than that of Group A. The process from the observation of the sample to the judgment of whether or not the hypothesis is made is called the test. The process of testing does not logically prove that the hypothesis is correct or erroneous, but rather, in the evidence sense of the sample observations, the probability is assumed to be correct or the hypothesis is wrong.

Hypothesis testing has a series of theories and methods in the development process, we focus on the likelihood ratio test method. Before that you need to understand the maximum likelihood estimate, which will help you understand the likelihood ratio test method. The classification and probability theory in the column is a good starting point for understanding the maximum likelihood estimate.

Likelihood ratio hypothesis test

Question 1: With the sample X, the probability function of X is f(x, 𝜃), 𝜃 ∈ Θ , Θ𝙷 is a non-empty true subset of Θ, and the proposition H: 𝜃₀ ∈ Θ𝙷 is called the null hypothesis, and its exact meaning is existence. 𝜃₀ ∈ Θ𝙷 , such that the probability function of X is f(x,𝜃₀), and Θ𝙺 = Θ — Θ𝙷 , then the proposition K: 𝜃 ∈ Θ𝙺, is called the opposite hypothesis of H.

H:𝜃 ∈ Θ𝙷 ↔ K:𝜃 ∈ Θ𝙺, is a hypothesis test problem. The goal is to determine whether H is correct based on the specific observation of X, or to select one of H and K.

Likelihood ratio hypothesis test

Let sample X have a probability function f(x,𝜃), 𝜃 ∈ Θ,Θ𝙷 is a non-empty true subset of Θ, consider the problem 1 has statistics (can understand that the statistic is a variable, and its dependent value is only the observation value of X)

Where C is required to be set by you. When LR(x) is large, it means that the denominator is small. At this time, H should be denied, that is, φ(x)=1, negating H, LR(x) > C is a negative domain. When LR(x) is small, the denominator is large and should accept H.

For example:

Where 𝜃₀ is given

We have

It can be concluded by maximum likelihood estimation that

Then

This is a monotonically increasing function for |\bar{x}-𝜃₀|, which gives the rejection domain of {|\bar{x}-𝜃₀|>C}, then the accepted domain corresponds to {|\bar{x}- 𝜃₀|≤C}. The value of C is a relatively long story. In general, if you want to increase the credibility of rejecting H, the bigger the C, the better. In professional words, the value of C and the required test level and The distribution of \bar{X} is related. There will be an article to explain the relationship between the value of C and the distribution.

A/B test

Example 1:

There are 100 plots of land, each of which is divided into two parts of equal area, the original rice seeds, and the modified rice seeds. For each plot of land, the difference between the yield of the improved rice seed and the yield of the original rice seed can be calculated. If the improved rice seed in a piece of land is higher than the original yield, then the difference is a positive number. If the yield is low, the difference is negative. This can have 100 numbers and can be written as 𝑥₀, 𝑥₁, …, 𝑥₉₉. If their mean values are positive, it indicates that the yield of improved seeds is high. How can we use hypothesis testing to solve this problem?

This problem can be translated into:

The sample X, X ∼ N(𝜃, σ²) , 𝜃, σ are all unknowns, and there is a hypothesis test problem, H: 𝜃 ≤ 0 ↔ K:𝜃 > 0

Inspection process:

With sample X₀, X₁,…X𝚗-1, we have likelihood function

Where \hat{σ} is the maximum likelihood estimate for σ.

With the expression LR, a number C is determined by the required level of inspection, and the associated distribution in LR.

In the case of φ(x)=1, reject H, φ(x)=0, and accept H.

Example 2:

If the number of samples in the A and B groups is inconsistent in the A/B test, these cannot be analyzed as in the case of the example 1 except that a random variable is introduced.

Problem: There are samples X₀, X₁,…X_𝚖-₁ ~ N(a, σ²), Y₀, Y₁, …Y𝚗-1 ~ N(b, σ²), here is the assumption that the variance of the distribution of the samples of group A and group B is the same.

Inspection process:

You can write together the two distributed samples, (X₀, X₁,…X_𝚖-₁, Y₀, Y₁, …Y𝚗-1), you can write its likelihood function

Then LR is expressed in turn, and a number C is determined by the required level of inspection and the associated distribution in LR. The process here is rather cumbersome. I choose to skip this part. If you are interested, you can refer to Chen Xiru’s Mathematical Statistics Tutorial, Section 3.6, Test of Normal Distribution Parameters (4) Two-sample t-test.

Summary

So far, we have a preliminary explanation of the relationship between A/B test and hypothesis test. How to use the likelihood ratio test to help the whole logic process of A/B test is relatively clear, but the specific calculation process, and some details have not been mentioned.

What needs to be explained later includes

  • What is the inspection level?
  • How to choose the number C mentioned, what is the relationship between it and the test level?
  • The number C mentioned is also related to some of the distributions in LR, so what is the relationship?
  • It is assumed here that the two distributions of the samples of group A and group B are normal distributions with equal variance, so the true case is not equal. how to deal with that

These questions will be answered one by one in the following articles.

Many of the articles in this column have solved some problems to a certain extent, but I still encourage everyone to continue to conduct in-depth research on the content that you are interested in, using the reference in articles.

Reference

  1. 人话版 Hypothesis Testing(假设检验)
  2. Mathematical Statistics Tutorial by Chen Xiru
  3. Welch’s t-test — Wikipedia

--

--