5 Useful Statistical Tests in Data Science 📊

Salman Hossain
3 min readSep 10, 2022

--

Word Cloud Generated from https://monkeylearn.com/word-cloud

Two Sample Z Test 2️⃣

How does the mean of two samples differ?

The density function of two normally distributed samples with the mean being shown with the dotted vertical line.

Null Hypothesis: The sample mean for both samples is the same.

When you want to compare the mean (continuous values) of two different groups. This is where the Z test comes in handy where you can have some rigor behind justifying if the means of two samples are similar or not.

Code for graphing the density function of two normally distributed samples — https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.ztest.html

Rank Sum Test (Mann-Whitney) 💁

Are the distributions of these two samples the same?

Normal distribution with mean=5, std=3, samples=1000 & Gamma Distribution with alpha=2 and default beta

Null Hypothesis: The distribution of the two samples is the same.

Example: Are the heights of mountains in the US distributed similarly to mountains 🗻 in Japan?

When looking at different samples one would naturally ask are these the same distributions? In this example, the red is from a normal distribution and the blue is from a gamma distribution. So the test should yield a p-value of < 0.05 rejecting the null hypothesis.

Generates two separate samples one forming a gamma and the other a normal distribution and plots the density function for both samples.

Pearson’s Chi-Square test 🍵

Does the frequency distribution between the two samples differ?

The formula for calculating the chi-squared test statistic.* [4]

Null Hypothesis: The frequency is the same between both distributions.

A statistical test to evaluate if the frequency of certain events is occurring more frequently in comparison to another sample³. This test is used quite often in biology examing if two variables are independent of each other.

Example: Determine if different categories of consumers' frequency to purchase product A. (The categories of consumers can be categorized by some kind of clustering algorithm like K-Means initially)

Binomial Test 👬

Is the success rate between the control and the treatment group differ?

Null Hypothesis: The success rate is the same for both samples.

Example: Is the treatment group that saw the promotional advertisement more likely to buy Starbucks coffee ☕️ than the control group that didn’t?

Note: The binomial test is similar to the chi-square test with the exception that the binomial test only deals with 2 classes i.e the control and treatment groups whereas the chi-square test can deal with many. Furthermore, the binomial test is typically used with smaller samples rather than large ones.

Shapiro-Wilk Test 💠

Is the sample normally distributed?

Null Hypothesis: The sample is normally distributed.

This is the Ben Shapiro-.. um I mean Shapiro-Wilk Test. A test by Samuel Shapiro and Martin Wilk was published in 1965 ¹.

Like all hypothesis testing if the p-value is under 0.05 then we reject the null and in the case of the Shapiro-Wilk test, this would mean that the sample is not normally distributed.

The W statistic ranges from 0 to 1 indicating the sample’s normality 0 being not at all and 1 being exactly normal².

How to conduct a shapiro test using scipy

Made it all the way to the 🔚? Be sure to follow me if you want to see more of my posts 😄.

Citations

[1] https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test

[2] https://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm

[3] https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test

[4] https://homes.cs.washington.edu/~suinlee/genome560/lecture7.pdf

--

--