Statistics for Data Analysts: Inferential Statistics with Python

Published in

CodeX

6 min readSep 14, 2022

Introduction

In data analysis, Statistics is important in understanding data, discovering trends and analyzing data efficiently which coincides with the purpose of data analysis. Statistics is divided into 2 broad areas based on purpose: Descriptive Statistics and Inferential Statistics.

This article is the second in the series of Statistics for Data Analysis and it only covers Inferential Statistics using Python. Click here for the previous article on Descriptive Statistics with Python.

Inferential Statistics

Inferential statistics generally involves generating deductions and/or predictions about a population. In several cases, inferences are made about a population using a sample. Unlike descriptive statistics where a known sample/population data is described, inferential statistics uses sample data to make conclusions about the population.

Sampling and Sampling Techniques

Gathering information about the total population can be very difficult and in some cases, impossible. Due to this limitation, a smaller fraction of the population, known as the sample, is analyzed and inferences are made concerning the population using the sample data collected. It should be noted that the sample collected from a population has to be a representation of the population for correct deductions. Usually, this is dependent on factors such as the sample size and sampling techniques used.

Sampling Techniques

Generally, there are two sampling categories: Random/Probabilistic sampling and Non-Probability sampling. For the former, sampling is done at random and is not biased. However, for non-probability sampling, sampling is by deliberate choice. For example, you might want to select the best students to represent a school in competition instead of selecting students at random. Under these two broad categories lie several sampling techniques.

Simple Random Sampling is the simplest, and most common technique. Here, every element in the population has an equal chance of being selected. Another popular probabilistic sampling procedure is the stratified sampling technique. In this case, the population is divided into groups of related elements called strata. Samples are then collected from each stratum. For example, data might be collected from the population in strata of different age groups instead of at complete random.

The random.sample() function is typically used to select samples from a population in python, where the number of samples to be collected is passed as an argument.

Click here for the Python Documentation

Hypothesis Testing

Hypothesis testing is a statistical inference technique used to confirm or refute statements made about a population using the sample data provided. We can think of hypothesis testing as an experiment, an hypothesis is made before the experiment starts. After experimentation, we would confirm if the results agree with the statement or not.

Hypothesis testing is one of the most significant aspects of inferential statistics. There are several tests applied in hypothesis testing and the specific test to use depends on the data and purpose of the test. There are several hypothesis tests you would need to be familiar with in your journey as a data analyst. This article covers the following tests:

Z Test & T-Test
Correlation Test
Chi Square Tests

Z-Test & T-Test

Normal Distribution Image by Simply Psychology

The Z-Test is a hypothesis test typically used to determine if the means of two populations are significantly different or if the mean of a population is greater than, less than or equivalent to a specific value. This test is used when the variance(s) of the population(s) is/are known. It is also applied when the data follows a normal distribution. When the sample size is large, it is also assumed that the data follows a normal distribution.

Using a case study of the performance of students in 2 classes, the Z Test can be used to ascertain if there is a significant difference in score. In this scenario, the null hypothesis is that the mean scores from the two classes are equal. The hypothesis test would enable us to support or refute this claim. Usually, for hypothesis tests, a 5% level of significance is applied and the claim is rejected if the p-value produced is less than the level of significance.

Check the documentation test here

The T-Test has a similar purpose as the Z-Test. However, it is applied when the population standard deviation is not known, or for samples with small sample sizes (n < 30).

Check here for more on T-Test

Let us paint another scenario of a coach who trains junior athletes to run a 100meters race. The coach believes that the average speed of her student is 10 seconds. To confirm this, she selects 10 athletes.

Check the documentation here

Correlation Test

Correlation describes the degree of relationship between two (or more) variables. For example, there might be a positive relationship between hours of practice and overall performance: “The more you practice, the better your results in an examination will be”.

The correlation test tests if the relationship between these variables is statistically significant. The Pearson Correlation Coefficient is a popular correlation coefficient that measures the linear relationship between 2 variables.

Photo by Fermin Rodriguez on Research Gate

For instance, the relationship between test scores and exam scores can be tested using Pearson correlation. The pearsonr function on Scipy returns the correlation coefficient and tests if the correlation is significant. The null hypothesis for correlation test is that there is no correlation between the variables.

Click here for the documentation

Chi-Square Tests

There are 3 types of Chi-Square Tests:

Chi-Square Test of Independence
Chi-Square Goodness of Fit Test
Chi-Square Test of Homogeneity

The most popularly of these tests are the Chi Square Test of Independence and the Goodness of Fit Test.

The Chi-Square Goodness of Fit test is used, mostly, to ascertain if the sample data is a true representation of the population. On the other hand, the Chi-Square test of independence is used to determine if the relationship between two categorical variables is significant. It is different from the Correlation test because, unlike the correlation test that focuses on quantitative variables, this chi-square test deals with categorical variables.

The scipy’s stats.chisquare function is used to compute the goodness of fit test while the chi2_contigency function is used to compute the chi-square test of independence.

Inferential Statistics is an extremely valuable tool for every potential data analyst. From applying sampling techniques in your data collection process to applying hypothesis tests to deduce from your data, it is too valuable to dismiss. It is worth mentioning that this article does not exhaust all the sampling techniques and hypothesis tests that exist. However, it covers some important and widely used ones that you would come across.

Thank you for reading to the end of this article ☺! Kindly click on the linked texts if you want more details about a concept. Don’t forget to share and clap if you gained something from this. Merci!