5 Useful Statistical Tests in Data Science 📊
Two Sample Z Test 2️⃣
How does the mean of two samples differ?
Null Hypothesis: The sample mean for both samples is the same.
When you want to compare the mean (continuous values) of two different groups. This is where the Z test comes in handy where you can have some rigor behind justifying if the means of two samples are similar or not.
Rank Sum Test (Mann-Whitney) 💁
Are the distributions of these two samples the same?
Null Hypothesis: The distribution of the two samples is the same.
Example: Are the heights of mountains in the US distributed similarly to mountains 🗻 in Japan?
When looking at different samples one would naturally ask are these the same distributions? In this example, the red is from a normal distribution and the blue is from a gamma distribution. So the test should yield a p-value of < 0.05 rejecting the null hypothesis.
Pearson’s Chi-Square test 🍵
Does the frequency distribution between the two samples differ?
Null Hypothesis: The frequency is the same between both distributions.
A statistical test to evaluate if the frequency of certain events is occurring more frequently in comparison to another sample³. This test is used quite often in biology examing if two variables are independent of each other.
Example: Determine if different categories of consumers' frequency to purchase product A. (The categories of consumers can be categorized by some kind of clustering algorithm like K-Means initially)
Binomial Test 👬
Is the success rate between the control and the treatment group differ?
Null Hypothesis: The success rate is the same for both samples.
Example: Is the treatment group that saw the promotional advertisement more likely to buy Starbucks coffee ☕️ than the control group that didn’t?
Note: The binomial test is similar to the chi-square test with the exception that the binomial test only deals with 2 classes i.e the control and treatment groups whereas the chi-square test can deal with many. Furthermore, the binomial test is typically used with smaller samples rather than large ones.
Shapiro-Wilk Test 💠
Is the sample normally distributed?
Null Hypothesis: The sample is normally distributed.
This is the Ben Shapiro-.. um I mean Shapiro-Wilk Test. A test by Samuel Shapiro and Martin Wilk was published in 1965 ¹.
Like all hypothesis testing if the p-value is under 0.05 then we reject the null and in the case of the Shapiro-Wilk test, this would mean that the sample is not normally distributed.
The W statistic ranges from 0 to 1 indicating the sample’s normality 0 being not at all and 1 being exactly normal².
Made it all the way to the 🔚? Be sure to follow me if you want to see more of my posts 😄.
Citations
[1] https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test
[2] https://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm
[3] https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
[4] https://homes.cs.washington.edu/~suinlee/genome560/lecture7.pdf