A/B testing and hypothesis testing II -- the significance level

Qiang Chen
Machine Learning and Math
9 min readNov 24, 2018

Foreword

A previous article on A/B testing, A/B testing and hypothesis testing I, referred to the reasons for the A/B testing, and how to use the theory of hypothesis testing to provide a basis for A/B testing. However, due to space and other reasons, there are still some details and processes that are not explained.

  • What is the significance level?
  • How to choose the number C mentioned, what is the relationship between it and the significance level?
  • The number C mentioned is also related to some of the distributions in LR, so what is the relationship?
  • It is assumed here that the two distributions of the samples of group A and group B are normal distributions with equal variance, so the true case is not equal. how to deal with that

This article is the successor to the A/B test and hypothesis test 1. It will discuss the above issues one by one. The content of this article is closely related to the significance level, so the name, A/B testing and hypothesis testing II — significance level.

Review

Question 1: With the sample X, the probability function of X is f(x, 𝜃), 𝜃 ∈ Θ , Θ𝙷 is a non-empty true subset of Θ, and the proposition H: 𝜃₀ ∈ Θ𝙷 is called the null hypothesis, and its exact meaning is existence. 𝜃₀ ∈ Θ𝙷 , such that the probability function of X is f(x,𝜃₀), and Θ𝙺 = Θ — Θ𝙷 , then the proposition K: 𝜃 ∈ Θ𝙺, is called the opposite hypothesis of H.

H:𝜃 ∈ Θ𝙷 ↔ K:𝜃 ∈ Θ𝙺, is a hypothesis test problem. The goal is to determine whether H is correct based on the specific observation of X, or to select one of H and K.

Likelihood ratio hypothesis test

Let sample X have a probability function f(x,𝜃), 𝜃 ∈ Θ,Θ𝙷 is a non-empty true subset of Θ, consider the problem 1 has statistics (can understand that the statistic is a variable, and its dependent value is only the observation value of X)

Where C is required to be set by you. When LR(x) is large, it means that the denominator is small. At this time, H should be denied, that is, φ(x)=1, negating H, LR(x) > C is a negative domain. When LR(x) is small, the denominator is large and should accept H.

For example:

Where 𝜃₀ is given

We have

It can be concluded by maximum likelihood estimation that

Then

This is a monotonically increasing function for |\bar{x}-𝜃₀|, which gives the rejection domain of {|\bar{x}-𝜃₀|>C}, then the accepted domain corresponds to {|\bar{x}- 𝜃₀|≤C}. The value of C is a relatively long story. In general, if you want to increase the credibility of rejecting H, the bigger the C, the better. In professional words, the value of C and the required test level and The distribution of \bar{X} is related. There will be an article to explain the relationship between the value of C and the distribution.

In the above example,

Where 𝜃₀ is given, then

This is a monotonically increasing function for |\bar{x}-𝜃₀|, which gives the rejection domain of {|\bar{x}-𝜃₀|>C}, then the accepted domain corresponds to {|\bar{x}- 𝜃₀|≤C}.

Significance level and what does it have to do with the number C?

\bar{x} is an observation of T=(X₁+X₂, …+ X𝚗 ) / n. Since X₁, …, X𝚗 ~ N(𝜃, 1) , we can get T ~ N(𝜃, 1 / n) , that is, \bar{x} obeys the normal distribution T. Although it observes the average of a series of data, it can theoretically be regarded as an observation of this distribution, then if H: 𝜃= 𝜃₀ is true, there will be some probability that \bar{x} deviates from the larger 𝜃₀, because \bar{x} obeys T ~ N(𝜃₀, 1 / n) at this time, in other words, there is a certain probability, In the case of |\bar{x}-𝜃₀,|>C, a conclusion of negative H is obtained, at which point H is incorrectly rejected. Similarly, when H does not hold, there is also a case where H is not rejected. The classification of these two types of problems is called the type I error and type II error.

type I error and type II error
precision and recall

The precision and recall rates of the two-category problem are used to better understand the type I error and type II error.

For judging whether a picture contains a hot dog, there is a trained classifier. For any given picture, the possibility of containing a hot dog can be output, represented by a number between 0 and 1. We can set a number B. If the probability is greater than B, we think that this picture contains a hot dog. If it is less than or equal to B, it is considered not.

the for judgment result

For a data set containing N data, after setting B, there are 4 combinations of the judgment result and the actual result. In the above figure, TP, FP, FN, and TN represent the number of data of the corresponding combination, and it is obvious that TP +FP+FN+TN=N. Where TP / ( TP + FP ) is called precision, which means that the proportion of true hot dog content is included in all pictures that are considered to contain hot dogs. TP / (TP+FP) is called recall, which means finding the proportion of all real hot dog images among the pictures that are considered to contain hot dogs. For example, a person would like to see a photo of a hot dog, and at the same time do not want to see too many pictures that are not hot dogs, then he is hoping for high precision. If you want to see as many hot dogs as possible in the photo gallery, don’t care if you see a photo that is not a hot dog, then he wants to high recall.In general, raising B can increase precision, but at the same time it will lower the recall. Lower B to increase recall, but at the same time reduce precision. Of course, if the classifier is very well trained, no matter how you adjust B, both recall and precision remain high. 1-precision can be understood as the type I error. By controlling the type I error, the credibility of K can be guaranteed, that is, rejecting H and rejecting the credibility of the null hypothesis. The smaller the first type of error, the greater the credibility of rejecting the null hypothesis.

The probability of the type I error is the significance level, and the value of C can be controlled to achieve different significance levels. Generally, the significance level can be taken as 0.001, 0.005, 0.01, 0.05. Different levels of testing indicate the magnitude of the likelihood of false rejections in all rejected decisions, which in turn indicates the precision of correct rejection.

For example, there are 10,000 square meters of new crops, and the output per square meter is normally distributed. Take 100 square meters to see if the output is higher than the previous level. For them, H: 𝜃 <= 𝜃₀ Not meeting expectations, K: 𝜃 > 𝜃₀ is in line with expectations. If the type I error is 0.01, that is, the significance level is 0.01, indicating that in the case of rejecting the decision of H, that is, the result is expected to be in line with the expected result, there is a 1% probability of error, that is, it is mistakenly considered to be in line with expectations. If 100 such experiments are judged to be in line with expectations, then on average one test will be judged wrong. Note that it is necessary to set the non-conformity to H and set the expected situation to the opposite K of H, otherwise, the previous conclusion cannot be drawn. The result is significant if the H hypothesis (also known as the null hypothesis, or the null hypothesis) is rejected.

The analysis of significance level, number C and LR

In the above example,

Where 𝜃₀ is given, then

This is a monotonically increasing function for |\bar{x}-𝜃₀|, which gives the rejection domain of {|\bar{x}-𝜃₀|>C}, then the accepted domain corresponds to {|\bar{x}- 𝜃₀|≤C}.

Here is the type I error,

Rejecting H means that |\bar{x}-𝜃₀|>C , H holds 𝜃=𝜃₀, and if the significance level is 0.05, it means

ℬ =P(|\bar{x}-𝜃₀|>C|𝜃=𝜃₀) = 0.05

Since 𝜃=𝜃₀, then \bar{x} ~ N(𝜃₀, 1/n), \bar{x}- 𝜃₀ ~ N(0, 1/n)

How to choose C to meet the significance level of 0.05

As shown above, the density function of \bar{x} - 𝜃₀ is N(0, 1/n) (where n is the number of observations x𝚒). By this density function, and the test level of 0.05, it is possible to find a conditional C. As can be seen from the above figure, if the level of the test needs to be smaller, that is, the probability of the first type of error is smaller, the value of C can be increased, which means the negative domain {|\bar{x} — 𝜃₀|>C} will be smaller, making it more difficult to deny the null hypothesis. If \bar{x} is unchanged, the number n of observed data becomes larger, and if the test level is unchanged, the density function of the above figure will be more concentrated to 0, and the area of the current pink position will become smaller. This means that C will become smaller and n will become larger, allowing the same \bar{x} to reach the standard of rejecting the null hypothesis.

the relationship of LR, significance level ℬ and C

Summary

In order to better understand the type I error, the concept of binary classification precision and recall are introduced. The type I error, also known as the significance level, is used to control the decision precision to reject the null hypothesis. We hope that the decision to reject can be relied upon. Generally speaking, in the A/B test, the original hypothesis H can be set to be no difference between group A and group B, or group B does not improve the user experience. K assumes that group B is better than group A and has improved the user experience. When the rejection of the null hypothesis is made at a certain significance level, it indicates that the B group does improve to a certain extent. If the significance level is 0.05, it can be understood that the reliability is 0.95, or the precision is 0.95.

Among the problems left before

  • What is the significance level?
  • How to choose the number C mentioned, what is the relationship between it and the significance level?
  • The number C mentioned is also related to some of the distributions in LR, so what is the relationship?
  • It is assumed here that the two distributions of the samples of group A and group B are normal distributions with equal variance, so the true case is not equal. how to deal with that

This article addresses the first three questions. Finally, the analysis of the A/B test in the real industrial environment is limited to the length, which will be explained in a later article.

References

  1. Mathematical Statistics Tutorial by Chen Xiru
  2. A/B测试与假设检验
  3. Comzyh:正态分布随机变量的和还是正态分布吗?
  4. Precision and recall

--

--