#1 | What is Statistics? | 7-Days of Statistics for Data Science

Statistics is essential part of Data Science. This article, cover the importance of statistics in data science and types of statistics in detail to understand the data using Python and Pandas library.

8 min readJun 29, 2023

Mathematics and Statistics are the foundation of Data Science and Machine Learning. Nowadays every company is data-driven and has a massive amount of data.

Data is nothing but a formless stream of bytes, it becomes information after processing and transforming raw data. For that, organizations are always looking for Data professionals who might help them to find meaningful insights from raw data to solve complex problems or to create intelligent systems which eventually help in the growth of the business revenue.

Today data is cheap and is available everywhere. However, asking the right question is expensive. Defining a problem definition is the first step in any machine-learning project. It is one of the critical skills in the Machine Learning or Data Science domain.

If you’re unable to define a problem; you have the data, but you have no idea how to utilize that data for business then how could you build a system that would help in making a successful business?

Statistics and mathematics can help you to understand historical data to ask the right questions. Machine learning is centered around statistics. It is an essential part of learning as it provides the tools and techniques to understand and manipulate data and infers meaningful information from it. If you are interested in data science, you must have a strong understanding of these subjects.

Welcome to the 7-Days of Statistics for Data Science series. In this series, we will learn the basics of statistics for Data Science with practical implementations using Python and its libraries.

In this first article, we will cover the basic concepts of statistics that every data scientist should know like What statistics are and their types. We will try to understand how to use them in Python and Pandas.

Let’s start!

What are Statistics?

Statistic is a branch of mathematics that concerns the collection, analysis, and interpretation of data to make sophisticated decisions.

Often in statistics, we are interested in the collection of data to find the answers to our questions about the population. For example, what is the average height of a Woman in a particular country?

The population represents every possible individual object that we are interested in measuring. However, gathering population data is not feasible, so researchers collect samples of data. The sample represents a subset of a population. Then we can generalize the inferences and insights from the sample to the larger population.

For example, if you want to answer the question such as What is the average weight of students in a particular school of 5000 students? In this scenario, the population is the weight of every student in the school. But it might take too long to take surveys of each 5000 students, so we might collect data for random 100 students and ask them their weight.

Here, 5000 students represent the population, and a random 100 students represent a sample. Then we can generalize the results from the sample to the entire population.

A sample must be representative of the population, for example, if the population of 5000 student data contains 60% of boys and 40% of girls then a random sample of 100 students from that population must contain 60% of boys and 40% of girl's data and only then we can generalize the results from sample to the overall population.

There are two main statistical methods used in data analysis:

Descriptive Statistics
Inferential Statistics

Descriptive Statistics

Descriptive statistics are summary statistics that describe or analyze the set of data to measures the main characteristics of the features using graphs, tables, or data visualization methods.

It helps us to understand the data in more depth. Descriptive statistics provides insights for what has happened in the past before attempting to explain why it happened or predicting what will happen in the future.

It commonly uses a measure of central tendencies such as mean, median, and mode and measures of variability or dispersion such as standard deviation or variance, the minimum and maximum values of the variables, kurtosis, and skewness of the data to give you an idea about the distribution of your data.

Characteristics of data can be represented in graphical format using histograms. Descriptive statistics may also be used to understand the relationships between the two variables using a contingency table or by using graphical representations such as box plots, or scatterplots.

We commonly use two statistical methods for analysis while doing exploratory data analysis in Machine Learning.

Univariate Analysis
Bivariate Analysis

Univariate Analysis

The univariate analysis provides summary statistics of a single variable using measure of central tendency and measure of variability. It helps us to study the distribution of each feature in depth so that we can utilize this analysis further in feature selection for model optimization.

Let’s see how we can use python and pandas to perform univariate analysis on data, for that I have used student performance dataset from Kaggle.

# Import pandas library
import pandas as pd

# Read Data
data = pd.read_csv("StudentsPerformance.csv")

We can print the basic metadata information about the data using pandas pandas.DataFrame.info() method. It returns total number of rows and columns in datasets, datatypes of each feature with number of not null values, total memory usage of data.

# Metadata Information
>>> data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB

To know quick summary statistics of data, we can use pandas pandas.DataFrame.describe() method. It returns a data frame with values of mean, standard deviation, total count, and five pointers — minimum and maximum, 25th, 50th and 75th percentile for each numerical feature.

For categorical features, we can use the same method with parameter include=all, it returns summary statistics for each features including categorical columns. You can explicitly set it with list of datatypes of columns you want summary description for.

# Summary statistics of numerical features
>>> data.describe()

       math score  reading score  writing score
count  1000.00000    1000.000000    1000.000000
mean     66.08900      69.169000      68.054000
std      15.16308      14.600192      15.195657
min       0.00000      17.000000      10.000000
25%      57.00000      59.000000      57.750000
50%      66.00000      70.000000      69.000000
75%      77.00000      79.000000      79.000000
max     100.00000     100.000000     100.000000


# Summary statistics for categorical features with `Object` type.
>>> data.describe(include=['O'])
        gender race/ethnicity parental level of education     lunch  \
count     1000           1000                        1000      1000   
unique       2              5                           6         2   
top     female        group C                some college  standard   
freq       518            319                         226       645   

       test preparation course  
count                     1000  
unique                       2  
top                       none  
freq                       642  

# Summary statistics for all features.
>>> data.describe(include='all')
        gender race/ethnicity parental level of education     lunch  \
count     1000           1000                        1000      1000   
unique       2              5                           6         2   
top     female        group C                some college  standard   
freq       518            319                         226       645   
mean       NaN            NaN                         NaN       NaN   
std        NaN            NaN                         NaN       NaN   
min        NaN            NaN                         NaN       NaN   
25%        NaN            NaN                         NaN       NaN   
50%        NaN            NaN                         NaN       NaN   
75%        NaN            NaN                         NaN       NaN   
max        NaN            NaN                         NaN       NaN   

       test preparation course  math score  reading score  writing score  
count                     1000  1000.00000    1000.000000    1000.000000  
unique                       2         NaN            NaN            NaN  
top                       none         NaN            NaN            NaN  
freq                       642         NaN            NaN            NaN  
mean                       NaN    66.08900      69.169000      68.054000  
std                        NaN    15.16308      14.600192      15.195657  
min                        NaN     0.00000      17.000000      10.000000  
25%                        NaN    57.00000      59.000000      57.750000  
50%                        NaN    66.00000      70.000000      69.000000  
75%                        NaN    77.00000      79.000000      79.000000  
max                        NaN   100.00000     100.000000     100.000000

When you print description for all features it returns NaNfor unapplicable values. For example, for column gender there is no average or standard deviation for values such as female or male. For categorical variable it returns total number of observations, Number of unique values present in columns, Most frequent value with its count.

You can further evaluate each individual feature by using histogram plot or for categorical features we can use different method of pandas such as pandas.DataFrame[column_name].value_counts() to find unique count of each value.

Bivariate Analysis

Bivariate analysis examines the relationships between two or more variables. It helps you to determine whether variables are correlated or not. Visualizations are often used together with qualitative analysis as a more intuitive way of presenting the result. It provides insights into the problem and helps to develop ideas or hypotheses for potential quantitative research.

Using pandas.DataFrame.groupby() method and plots we can examine the relationship between the two or more variables as well as with features and response variable.

Let’s examine the relationship between math score and writing score of students using scatterplot using matplotlib library.

# Import matplotlib
import matplotlib.pyplot as plt

# Scatter plot
plt.scatter(data['math score'], data['writing score'])

# Set lables
plt.xlabel("Math Score")
plt.ylabel("Writing Score")
plt.title("Scatter Plot")

plt.grid(ls='--', c='#000', alpha=0.2)
plt.show()

From the above scatterplot, we can say that there is positive correlation between these two scores.

Inferential Statistics

In statistics, inferential statistical analysis tries to learn about the population considering the sample represents the larger population using a statistical model. For example, testing a hypothesis and concluding a population from sampled data. However, the sample might not provide a perfect estimation of the population all of the time. To measure this uncertainty, we can create a confidence interval.

A confidence interval is a range of values that is likely to contain a population parameter with a certain level of confidence. For example, we might produce a 95% confidence interval of [61.5, 64.5] which says that we are 95% confident that the average weight of students from a certain school is between 61.5 kg and 64.5 kg.

In Machine Learning, the term inference is sometimes meant to make a prediction using previously trained models. Inferring the properties of the model is referred to as training or learning. Even though machine learning and statistics are closely related fields in terms of methods, their principal goals are different. Machine learning finds generalizable predictive functions based on historical data that can be utilized to predict future outcomes using model functions.

In summary, descriptive statistics are used to understand the characteristics or distribution of features of a dataset, and Inferential statistics are used to conclude populations based on sample data using hypothesis tests, or confidence intervals. The ultimate goal of data analysis is to provide insight. These insights from the statistical analysis can be utilized in further processes of feature selection or engineering.

I hope this article helps you to understand statistics and its types.

Thank you so much for reading! 😊🙏

#1 | What is Statistics? | 7-Days of Statistics for Data Science

Statistics is essential part of Data Science. This article, cover the importance of statistics in data science and types of statistics in detail to understand the data using Python and Pandas library.

Written by Madhuri Patil