#1 | What is Statistics? | 7-Days of Statistics for Data Science

Statistics is essential part of Data Science. This article, cover the importance of statistics in data science and types of statistics in detail to understand the data using Python and Pandas library.

Madhuri Patil
8 min readJun 29, 2023

Mathematics and Statistics are the foundation of Data Science and Machine Learning. Nowadays every company is data-driven and has a massive amount of data.

Data is nothing but a formless stream of bytes, it becomes information after processing and transforming raw data. For that, organizations are always looking for Data professionals who might help them to find meaningful insights from raw data to solve complex problems or to create intelligent systems which eventually help in the growth of the business revenue.

Today data is cheap and is available everywhere. However, asking the right question is expensive. Defining a problem definition is the first step in any machine-learning project. It is one of the critical skills in the Machine Learning or Data Science domain.

If you’re unable to define a problem; you have the data, but you have no idea how to utilize that data for business then how could you build a system that would help in making a successful business?

Statistics and mathematics can help you to understand historical data to ask the right questions. Machine learning is centered around statistics. It is an essential part of learning as it provides the tools and techniques to understand and manipulate data and infers meaningful information from it. If you are interested in data science, you must have a strong understanding of these subjects.

Welcome to the 7-Days of Statistics for Data Science series. In this series, we will learn the basics of statistics for Data Science with practical implementations using Python and its libraries.

In this first article, we will cover the basic concepts of statistics that every data scientist should know like What statistics are and their types. We will try to understand how to use them in Python and Pandas.

Let’s start!

What are Statistics?

Statistic is a branch of mathematics that concerns the collection, analysis, and interpretation of data to make sophisticated decisions.

Often in statistics, we are interested in the collection of data to find the answers to our questions about the population. For example, what is the average height of a Woman in a particular country?

The population represents every possible individual object that we are interested in measuring. However, gathering population data is not feasible, so researchers collect samples of data. The sample represents a subset of a population. Then we can generalize the inferences and insights from the sample to the larger population.

For example, if you want to answer the question such as What is the average weight of students in a particular school of 5000 students? In this scenario, the population is the weight of every student in the school. But it might take too long to take surveys of each 5000 students, so we might collect data for random 100 students and ask them their weight.

Here, 5000 students represent the population, and a random 100 students represent a sample. Then we can generalize the results from the sample to the entire population.

A sample must be representative of the population, for example, if the population of 5000 student data contains 60% of boys and 40% of girls then a random sample of 100 students from that population must contain 60% of boys and 40% of girl's data and only then we can generalize the results from sample to the overall population.

There are two main statistical methods used in data analysis:

  • Descriptive Statistics
  • Inferential Statistics

Descriptive Statistics

Descriptive statistics are summary statistics that describe or analyze the set of data to measures the main characteristics of the features using graphs, tables, or data visualization methods.

It helps us to understand the data in more depth. Descriptive statistics provides insights for what has happened in the past before attempting to explain why it happened or predicting what will happen in the future.

It commonly uses a measure of central tendencies such as mean, median, and mode and measures of variability or dispersion such as standard deviation or variance, the minimum and maximum values of the variables, kurtosis, and skewness of the data to give you an idea about the distribution of your data.

Characteristics of data can be represented in graphical format using histograms. Descriptive statistics may also be used to understand the relationships between the two variables using a contingency table or by using graphical representations such as box plots, or scatterplots.

We commonly use two statistical methods for analysis while doing exploratory data analysis in Machine Learning.

  • Univariate Analysis
  • Bivariate Analysis

Univariate Analysis

The univariate analysis provides summary statistics of a single variable using measure of central tendency and measure of variability. It helps us to study the distribution of each feature in depth so that we can utilize this analysis further in feature selection for model optimization.

Let’s see how we can use python and pandas to perform univariate analysis on data, for that I have used student performance dataset from Kaggle.

# Import pandas library
import pandas as pd

# Read Data
data = pd.read_csv("StudentsPerformance.csv")

We can print the basic metadata information about the data using pandas pandas.DataFrame.info() method. It returns total number of rows and columns in datasets, datatypes of each feature with number of not null values, total memory usage of data.

# Metadata Information
>>> data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 1000 non-null object
1 race/ethnicity 1000 non-null object
2 parental level of education 1000 non-null object
3 lunch 1000 non-null object
4 test preparation course 1000 non-null object
5 math score 1000 non-null int64
6 reading score 1000 non-null int64
7 writing score 1000 non-null int64
dtypes: int64(3), object(5)
memory usage: 62.6+ KB

To know quick summary statistics of data, we can use pandas pandas.DataFrame.describe() method. It returns a data frame with values of mean, standard deviation, total count, and five pointers — minimum and maximum, 25th, 50th and 75th percentile for each numerical feature.

For categorical features, we can use the same method with parameter include=all, it returns summary statistics for each features including categorical columns. You can explicitly set it with list of datatypes of columns you want summary description for.

# Summary statistics of numerical features
>>> data.describe()

math score reading score writing score
count 1000.00000 1000.000000 1000.000000
mean 66.08900 69.169000 68.054000
std 15.16308 14.600192 15.195657
min 0.00000 17.000000 10.000000
25% 57.00000 59.000000 57.750000
50% 66.00000 70.000000 69.000000
75% 77.00000 79.000000 79.000000
max 100.00000 100.000000 100.000000


# Summary statistics for categorical features with `Object` type.
>>> data.describe(include=['O'])
gender race/ethnicity parental level of education lunch \
count 1000 1000 1000 1000
unique 2 5 6 2
top female group C some college standard
freq 518 319 226 645

test preparation course
count 1000
unique 2
top none
freq 642

# Summary statistics for all features.
>>> data.describe(include='all')
gender race/ethnicity parental level of education lunch \
count 1000 1000 1000 1000
unique 2 5 6 2
top female group C some college standard
freq 518 319 226 645
mean NaN NaN NaN NaN
std NaN NaN NaN NaN
min NaN NaN NaN NaN
25% NaN NaN NaN NaN
50% NaN NaN NaN NaN
75% NaN NaN NaN NaN
max NaN NaN NaN NaN

test preparation course math score reading score writing score
count 1000 1000.00000 1000.000000 1000.000000
unique 2 NaN NaN NaN
top none NaN NaN NaN
freq 642 NaN NaN NaN
mean NaN 66.08900 69.169000 68.054000
std NaN 15.16308 14.600192 15.195657
min NaN 0.00000 17.000000 10.000000
25% NaN 57.00000 59.000000 57.750000
50% NaN 66.00000 70.000000 69.000000
75% NaN 77.00000 79.000000 79.000000
max NaN 100.00000 100.000000 100.000000

When you print description for all features it returns NaNfor unapplicable values. For example, for column gender there is no average or standard deviation for values such as female or male. For categorical variable it returns total number of observations, Number of unique values present in columns, Most frequent value with its count.

You can further evaluate each individual feature by using histogram plot or for categorical features we can use different method of pandas such as pandas.DataFrame[column_name].value_counts() to find unique count of each value.

Bivariate Analysis

Bivariate analysis examines the relationships between two or more variables. It helps you to determine whether variables are correlated or not. Visualizations are often used together with qualitative analysis as a more intuitive way of presenting the result. It provides insights into the problem and helps to develop ideas or hypotheses for potential quantitative research.

Using pandas.DataFrame.groupby() method and plots we can examine the relationship between the two or more variables as well as with features and response variable.

Let’s examine the relationship between math score and writing score of students using scatterplot using matplotlib library.

# Import matplotlib
import matplotlib.pyplot as plt

# Scatter plot
plt.scatter(data['math score'], data['writing score'])

# Set lables
plt.xlabel("Math Score")
plt.ylabel("Writing Score")
plt.title("Scatter Plot")

plt.grid(ls='--', c='#000', alpha=0.2)
plt.show()

From the above scatterplot, we can say that there is positive correlation between these two scores.

Inferential Statistics

In statistics, inferential statistical analysis tries to learn about the population considering the sample represents the larger population using a statistical model. For example, testing a hypothesis and concluding a population from sampled data. However, the sample might not provide a perfect estimation of the population all of the time. To measure this uncertainty, we can create a confidence interval.

A confidence interval is a range of values that is likely to contain a population parameter with a certain level of confidence. For example, we might produce a 95% confidence interval of [61.5, 64.5] which says that we are 95% confident that the average weight of students from a certain school is between 61.5 kg and 64.5 kg.

In Machine Learning, the term inference is sometimes meant to make a prediction using previously trained models. Inferring the properties of the model is referred to as training or learning. Even though machine learning and statistics are closely related fields in terms of methods, their principal goals are different. Machine learning finds generalizable predictive functions based on historical data that can be utilized to predict future outcomes using model functions.

In summary, descriptive statistics are used to understand the characteristics or distribution of features of a dataset, and Inferential statistics are used to conclude populations based on sample data using hypothesis tests, or confidence intervals. The ultimate goal of data analysis is to provide insight. These insights from the statistical analysis can be utilized in further processes of feature selection or engineering.

I hope this article helps you to understand statistics and its types.

Thank you so much for reading! 😊🙏

--

--