How to apply hypothesis test in marketing data

James Wu
8 min readJan 21, 2019

--

A statistical way to draw a conclusion from data

Draw conclusions from a data set is an essential data analytic skill in every industry, and it is even more so in the marketing world. When you get your hands on a set of marketing data, how do you know if the differences you discovered is truly reflective of the entire population instead of just the samples you have? When you are analyzing the results from an A/B test, how confident are you with the performance lift knowing it might be due to random chances? These are situations that we constantly run into, and the hypothesis test offers a great scientific tool to help us evaluate the result and instill confidence in our conclusions.

A hypothesis test typically works like this:

  • State the hypotheses. Every hypothesis involves a null and alternative hypothesis which are mutually exclusive. e.g.: the null hypothesis-average exam between the two classes is equal: x1 = x2. alternative hypothesis-average exam between the two class are not equal: x1 != x2.
  • Formulate an analysis plan. This step involves picking a test method: z-test, t-test, chi-square, etc. Then pick a significance level, α. This is a threshold at which anything equal or below this probability level would be considered statistically unlikely, therefore you can reject the null hypothesis and state the alternative hypothesis. Typically α=0.05* (<5% probability), but I’ll put an asterisk on this because it is loosely followed in our industry. If you are in other industries like aerospace, you may want to use an even smaller α.
  • Analyze sample data. This step involves calculating a static score and a p-value, which is just re-calibrating the numbers onto a distribution curve with one sample group fixed at zero, and the other project on the distribution line. The observation is also associated with a p-value, which is the probability of getting an observation equal or more extreme when the null hypothesis is true. The smaller the p-value, the farther away from the observation from zero. Typically if the p-value is below the α level, the null hypothesis is rejected.
  • Draw conclusions.

Out of all the test methods, I personally like t-tests the most. 1) it is based on sampling statistics. which means it is not limited to the population like z-test. In the real world, it is just unpractical to obtain population metrics. Therefore we rarely get a chance to use z-test unless the sample size is really big. It is worth noting that, the bigger the sample size, the closer t and z test becomes, which intuitively makes sense. 2) Another reason I like the t-test is the range of applications. Regardless of the sample size or the sample data type (categorical or numeric). you can always formulate two sample groups to compare. 3) t-test is easy to explain. it is a hypothesis method comparing the mean values of two sample groups. When one sample mean is statistically much greater than the other mean, it is just intuitive to reject the null hypothesis and concludes the two sample means are not equal. Unlike other methods like F-test or ANOVA which is comparing sample variance. It is more complicated to set up and, most importantly, good luck trying to explain variance to your management team.

So without further ado, let’s go through a t-test and see how it works. Note this example is a two-sample independent t-test. Independent because the two sample groups are unrelated. A dependent or paired t-test is when you are testing different conditions on the same sample group. the dependent t-test is easier, and I will provide the equations and use cases after this exercise.

Let’s say you have a setup of event data with information like No. of staff on-site, No. of consumer reached (KPI), etc. Let’s assume all other attributes are the same, and we want to analyze if No. of staff on-site (one versus many) has an impact on the performance KPI. Here are some sample statistics:

Step 1: state the hypothesis

Null hypothesis x1 = x2

Alternative hypothesis x1 != x2

Because we are looking at not equal, which means x1 can be greater than or less than x2. This is called a Two-tailed test. If you are only looking at one tail, you can set up the hypothesis like this: the null hypothesis x1 >= x2, alternative hypothesis x1 < x2. See this visually in the diagram below.

One-tailed vs. two-tailed test

I wouldn’t worry too much about using a one-tailed or two-tailed test at this point, because as we go through this example you will see that we can quickly change a Two-tailed to a One-tailed test by dividing the p-value in half or vise versa.

Step 2: Formulate an analysis plan.

We are already going with a t-test, and let’s just set α=0.05 for now.

Step 3: Analyze sample data

This is the step where we calculated the t-stats. Here is the equation:

Independent t-test equation

If you look online, there are several versions of this equation. I like this version because it will work in any circumstances. Notice you have to calculate the combined variance first before solving for t, and the degree of freedom is used at the end to compute the p-value. You can either estimate p using t-table or feed the numbers into python or another p-value calculator. Here is a python example.

from scipy import statspval_1 = stats.t.sf(np.abs(t), df) #one-tailed testpval_2 = stats.t.sf(np.abs(t), df)*2 #two-tailed test

Here are the calculated results. As we stated earlier, one-tailed p-values are just a two-tailed p-value divide by two.

Step 4: Draw a conclusion

At this point, I hope you still remember your null hypothesis: we believe group A (one staff) and group B (more than one staff) have the same KPI performance x1 = x2. I purposely set the example problem this way because the one-tailed and two-tailed p-value yield different conclusions using α=0.05. One-tailed test says we should reject the null hypothesis, and a two-tailed test suggests we don’t have enough evidence. This is where you have to use your own judgment. If it was left to me, I would conclude this is statistically significant and we should reject the null hypothesis because the problem is really about whether group B truly has a higher average, and a one-tailed test makes more sense. Plus .0564 is already close enough. It is outside of the 94.4% probability range and this is an example where α=0.05 rule is loosely followed.

Now let’s quickly show you dependent t-test. It is applied when results are performed on the same set of samples. For example test results of the patient’s IQ score before and after taking a performance-enhancing drug. The two sample data set are based on the same set of patients, and this is when a dependent t-test should be applied. Here are the equations, and the procedures are the same as above.

Dependent or paired t-test

Lastly, I want to share another test method I frequently use, the chi-square test. It is only used when evaluating nominal values. All the rest of the concept is the same, so I will just quickly go through a real-life example.

I was given the task to evaluate the outcome of two campaign creatives based on the click-through performance of a recent A/B test. The results are shown below:

numbers are made up

Notice we have to use the chi-square test because we are comparing nominal values (clicks) instead of sample data sets.

Step 1: state the hypothesis

We want to know if there is a significant difference between Creative A and Creative B on the click-through rates.

null hypothesis: no difference

alternative hypothesis: yes there is a difference

Step 2: Formulate an analysis plan.

We are already going with a chi-square test, and let’s still use α=0.05.

Step 3: Analyze sample data

To calculate for chi-square, we need to first look at the normal value in each of the 4 cases when everything is completely left to chance. To do this, you first sum up the rows and columns as shown in the table below. These numbers formulate a ratio about each attribute. For example, there are 60 total clicks in this test, based on the ratio between Creative A and B (25:75). If everything is left to chance, Creative A is expected to get 1/4 of the clicks, which is 15 60*25/(25+75). All the expected values are shown in red below.

chi-square equation

Then the rest is easy. Take the squared difference of the observed and expected value, divide by expected value, and sum up the results. Again, you can estimate the p-value by looking up the chi-square table or feed the number into any p-value calculator. python code and results are provided below:

from scipy import statspval = 1 - stats.chi2.cdf(x2, df)

Step 4: Draw a conclusion

0.157 is significantly greater than α=0.05, therefore we cannot reject the null hypothesis. Even though creative B has a lot more clicks, but it also showed a lot more frequently than creative A. Therefore we don’t have enough evidence to believe one was better than the other on the click-through rate, and more A/B needed to collect more data.

--

--