A Guide on How to Simulate and Visualize Hypothesis Tests Using Python Code

Published in

Analytics Vidhya

9 min readJun 30, 2021

We will be using Credit Card Churn dataset for demonstrating the use of Monte Carlo Simulation

Hypothesis Tests, if conducted properly, help in separating signal from the noise within our data. However, calculating metrics for hypothesis tests can be quite dull. Online tools and spreadsheets allow us to easily perform hypothesis tests. However, they hide the computations which occur in order to calculate the results. Given that, it becomes easy for data science learners to dismiss hypothesis testing, especially in front of cooler more fancy-sounding Statistical Modelling techniques in Machine Learning and Deep Learning.

In this blog, I will be demonstrating an alternative approach to Hypothesis Testing which is relatively easier to follow, fun to implement, and can be adapted to a ridiculously wide range of scenarios.

Monte Carlo Simulations

Monte Carlo Simulations are used to model various probabilistic events. It hopes to capture the multitude of possible outcomes by repeating an event in a probabilistically simulated manner till many observations have been taken. This method has the advantage of being able to solve complex problems in a much simpler manner and with easily interpretable results.

The method was named after popular gambling destination in Monaco. Coincidentally(or not), when Stanislaw Ulam discovered the Monte Carlo method, it was while trying to map out possible outcomes of a game of solitaire.

Monte Carlo method was discovered by Stanislaw Ulam while he tried to model possible outcomes of a game of solitaire

You will find Monte Carlo Simulations being used in a broad range of domains which include, but are not limited to finance, supply chain, project management, engineering and many more.

In this blog, I will talk about how we can use this method to better understand, interpret and present conclusions in the context of inferential analytics.

Data and Problem Statement

For this experiment, we are using a Churn Modelling dataset from Kaggle. Churn Modelling is a problem statement where we try to predict whether a customer is going to unsubscribe from using a service (in this case, their credit card)in a given period of time. Churn is also known as attrition.

However, it is often useful to know whether an independent variable has any impact on the response variable. Hypothesis tests are a common way to test whether an independent variable has any impact on the response. For example, an executive of a company may want to know whether discounts(independent variable) have any impact on sales(response variable).

In order to explain how hypothesis tests work, we will be working with Churn Modelling data and see if we find whether the gender(independent variable) of the customer has any impact on whether they will choose to leave the credit card company within a year(response variable).

The Hypothesis

The hypothesis forms the base of our experiment based on which the rest of the steps are carried out. Here, we define the null and alternate hypotheses for our experiment.

Ho: There is no significant difference between the proportion of men who churned and the proportion of women who churned. OR p1 = p2.

H1: There is a significant difference between the proportion of men who churned and the proportion of women who churned. OR p1 != p2.

Let us assume the alpha value to be 0.05(If you don’t know what is meant by alpha value, don’t worry, we will discuss this later in the blog).

Our aim is to check whether there is enough evidence to reject the null hypothesis. Please note that I will call, at various points in this blog, proportions as a sample statistic. We use this term because we don’t know the actual proportion of male and female customers who left. Rather, we have to make do with the sample data which has been taken from the population.

The Data

To find out if a particular gender has a higher likelihood of churn, we start by visualizing the dataset.

The proportion of females is clearly higher, but is it high enough?

In the above graph, we have visualized the proportion of men and women who stopped using their credit cards. It shows that out of 5,457 men, about 900 men left the credit card company. Values of the same for women correspond to about 4,543 and 1,140 respectively.

Though it is clear that the proportion of women is higher, without further analysis, it is difficult to say whether the difference in proportion is significant or not.

Naturally, the question arises as to what statistical techniques can be used to check for significance.

Bootstrapping

Bootstrapping, or bootstrap sampling is a (resampling) technique in statistics that is used while estimating population metrics.

In order to perform bootstrapping, we pick data points from the sample dataset with replacement. The chosen data points will then form our new dataset. As a result of the sampling with replacement during bootstrapping, the same observation may occur multiple times in our new dataset(which does not create any issue with respect to statistical analysis).

If you still feel confused, then please refer to links I have added below to refer to more material on bootstrapping.

We do the same for female customers as well. For each of the datasets thus created(each containing 500 samples from original datasets), we calculate the proportion of people who left the service. Thus, we can now create a sampling distribution(using 200 sample statistics) using the sample proportions obtained.

We now have two sampling distributions. One for female customers and the other for males. If we assume each of the proportions to be random variables(which we can as they fit the definition), we can view the sampling distribution as a histogram denoting the values it can take as well as the frequency of occurrences of those values.

If the histograms plotted for the two genders overlap significantly, we can be confident that there is no significant difference between the two proportions(vice versa also being true).

Having said that, let’s check out the results…

Distribution of the proportions of men(blue) and women who exited

The following couple of paragraphs may take some time to comprehend, so don’t worry if you don’t get it right away…

The above results were achieved when we created 200 datasets, each with 500 bootstrapped samples(In other words we recreated the experiment with random variations 200 times, each time obtaining 500 samples in our data). In other words, we created a probability distribution which illustrates the potential variability of the sample proportions. Therefore, we are trying to see all values that the sample proportions can take(by taking random variations into account) as well as the likelihood of each of those values.

This leads us to two potential outcomes. First, the variability may not account for the difference in observed proportions. In that case, Ho can be rejected. Alternatively, after taking variability into account, we may observe that values of the sample proportions tend to greatly overlap. In that case, we can declare that we haven’t found sufficient evidence to reject the Ho.

We can see that number of proportions that overlap in both of these samples are extremely small. Thus, even without further calculation, we can conclude that there is a significant difference between the two proportions. However, what if the results were not so stark? Or what if we wanted to quantify our results to a single number in order to add it into a report?

P-value

P-value is the probability of observing a value equal to or more extreme than the one observed when null hypothesis is assumed to be true. So in our case, we would expect a high amount of overlap in the sampling distributions of the two genders to correspond to a high P- value and vice versa.

P-value = number of samples which intersect / total number of observed sample statistics

Usually, the threshold is set at 0.05. In most circumstances, if P value is below 0.05, we can claim that our data doesn’t support the null hypothesis.

Now would be a good time to emphasize that this threshold(a.k.a. alpha value) is stated before the statistical analysis, i.e. while stating the null and alternate hypothesis. This practice ensure that alpha isn’t tweaked after obtaining p-value to achieve favorable results(which in most cases happens to be rejection of Ho).

So, how what is the p-value we obtained for our experiment, you ask ? Well, it turns out to be 0.0018. That means that if we assume the two proportions to be equivalent, there is only a 18 in 10,000 chance that we would get the sample distribution that we got. In other words, our data suggests that gender of a customer has a significant impact on whether they will leave the company.

This result makes sense as it is consistent with the graph that we plotted.

Alternative, more computationally efficient methods

Bootstrapping takes really long time and is computationally intensive. Instead, we can implement Monte Carlo simulation by assigning weighted probabilities to each outcomes. Weighted probabilities determine how likely an outcome is. For example, if we assume that in an unfair coin, outcome of tails is weighted as twice that of heads, then in the long run we will expect heads to come up twice as many times as tails. In our case, weight of getting 1(denoting people who exited) during simulation in python, is equivalent to proportion of people who exited the credit card service in the original dataset(refer the first bar graph).

In this method, we can generate a string of ones and zeroes(signifying whether a customer exited or did not exit), with a probability of their occurrence weighted by their occurrence in the original dataset. If you give it some thought, you will realize that the result from this method is equivalent to bootstrapping. Next, once we have several samples of the same, we can then calculate the proportion of people who exited for each of the datasets.

We then generate 1000 observations in a single sample, and repeat the process 10,000 times. It’s like we have 10,000 different possible datasets, each with 1000 observations! This is significantly more than what we achieved in the previous approach, but in fraction of the time. All thanks to efficiency of numpy!

From there, we can follow similar steps to what we did earlier. We find sample proportions for each of the statistic and make the probability distribution(see below).

We can then make the final conclusion using the p-value and comparing it to our significance level.

This method is more efficient than bootstrapping which allows us to increase the size of our samples; and if you have taken a STATS101 course, you will know that larger sample sizes means smaller margins of error, which leads to more certainty. Now, it doesn’t take a genius to figure out that more certainty is always good!

Histogram obtained from the second approach

The alternative method gives a p-value of 0.0022, which is consistent with our visualization as well as our previous result.

Summary

We conducted hypothesis tests to find out whether the proportion of men who exited was significantly different from the proportion of women who did the same.

We used bootstrap sampling methods in order to perform our analysis. We also repeated the experiment with an alternate, yet equivalent but more effective technique. Both of these methods correspond are different ways to implement what is technically the exact same experimental setup.

We learnt about p-value and how to calculate it in our use case. Based on the given data, we were able to derive conclusions which can then be reported to relevant authorities who can use it for decision making or further analysis.

Additional Resources

Calculating the value of pi using Monte Carlo simulation.
Simple explanation of how conventional hypothesis tests work. https://www.youtube.com/watch?v=2tuBREK_mgE https://www.youtube.com/watch?v=VJTFfIqO4TU
A Gentle Introduction to the Bootstrap Method (machinelearningmastery.com)
Hypothesis testing via simulation (duke.edu)
Simulation and Permutation(R)
Hypothesis Testing Basics

A Guide on How to Simulate and Visualize Hypothesis Tests Using Python Code

Written by Hamid Omar