Statistics an ancient tool to unearth the present-day Data Science

Rajesh Sharma
Geek Culture
Published in
8 min readJun 30, 2021
Image by “Freepik

We are witnessing an era of fusion where computer systems are getting educated on top of extensively available data. And, Statistics is one field that helps these systems to discover insights.

Here, is a quote from vault said by George Edward Box:

“Statistics is about the catalysis of the scientific method itself.”

this means, that for any scientific field Statistics is the essence. And, this blog is devoted to excavating and explaining underlying details of a few of the statistical techniques predominantly used in data science.

Also, will implement them from scratch and compare the results of self-written functions with other open-source libraries like MatplotLib, Seaborn, and others.

Top-level view

2-Way relationship b/w Population and Sample

Population: This term represents the group of people of interest.

For example, imagine yourself as a product manufacturer head at a beverages company that wants to know how many people in India like ‘jellies over cakes’ to increase the profit. Here, the group of people of your interest is the entire population of India also denoted as ‘N’.

Sample: This term represents the subset of the population that is considered as the actual image of it.

Let’s continue with the above example, rather than going to every person you randomly selected some ‘n’ number of people(say 1000) from the population. So, the k here represents the sample and ‘n’<< ‘N’.

Here, a question arises why we use Sample?

  • To speed up the analysis
  • Cost-effective
  • Speed up the decision making

Imagine if you been asked to reach out to every person in India to ask whether he/she like jelly or cake then such a survey would take a huge effort and time to get completed. Here, one thing we need to make sure that the sample must be a true image or representative of the population.

Cont’d the above example, imagine in the sample, you mainly surveyed the elderly people then such a sample won’t be a true image of the population and the analysis taken out on such a subset would not provide the accurate or near to accurate results. Ideally, a sample should consist of people from all age groups, gender, ethnicity, etc.

What are ‘Population Parameter’ and ‘Sample Statistic’?

Numerical Descriptor terms of Population & Sample

Population Parameter: It represents the numerical descriptions of a population characteristic.

For example, the mean height of all females in the world is 5 feet 5 inches. Here, population characteristic is the ‘height of all females’, and the numerical descriptor is mean height i.e. 5 feet 5 inches.

Sample Statistic: It represents the numerical descriptions of a sample characteristic.

For example, out of 100 females who visited a nearby store, 45% dislike chocolate cakes. Here, the sample characteristic is ‘dislike chocolate cakes’, and the numerical descriptor is 45%.

Branches of Statistics

Descriptive Statistics responsible for the below-mentioned points:

  • A process that involves summarizing or describing the data.
  • Carry out statistical analysis, e.g. finding the central tendency in data.
  • Involves extensive data exploration, visualization techniques to find the hidden trends and patterns.

Inferential Statistics responsible for the below-mentioned points:

  • Using descriptive statistics we map the statistical analysis carried out on the sample to come up with the best estimate of a population parameter.
  • Various tests are conducted to authenticate the sample statistic against the established standards.

How to generate Freq Histogram, Relative Freq Histogram, Cumulative Freq Histogram, and Probability Densities?

Imagine we have a below list of marks scored by 14 students out of a total of 50:

[11,11,12,13,14,11,16,17,18,20,30,40,45,50]

Now, in descriptive statistics we might try to find the answers to the below questions:

Q1. How many students scored 11 marks?

Q2. How many students scored less than the minimum passing marks say 20?

Q3. How much percentage of students scored more than 20?

Q4. How much percentage of students scored 20, 30, or 40?

Q5. How we can visualize the overall student scores?

In the above small data, it is quite easy to find the answers manually but if we have a large dataset then we plotting a bar graph, histogram, or relative frequency histogram, or cumulative frequency histogram will make our life very easy. So, let’s work that out step by step:

Step-1: Generate frequency distribution table from the data

The frequency distribution table contains the frequencies or number of occurrences of a dataset value within a specified closed range(called a class).

Frequency distribution table

Here, randomly divided the dataset into fixed class width i.e. 5, hence we ended up with 8 classes. For example, 11–15 is class 1, 16–20 is class 2, so on up to 46–50 is class 8.

Class width Formula

And, class width = (Upper value — Lower value) + 1. For example, in class 1 upper value is 15, and the lower value is 11.

Step-2: Calculate the class boundaries

Class boundaries are required to plot the histogram which is calculated using the below formula:

How class boundary is calculated?
Frequency distribution table with class boundaries
Function reporting class, relative freq’s, and prob density in a distribution table

Output of above plot_hist() function:

hist_data_results, bins_intervals, bins_prob_density = plot_hist(hist_data,number_of_bins=10)hist_data_results
Class frequency distribution table

Here, you can see the dataset is divided into fixed-width(generally it is the same but it can also vary among classes) classes. Each class having a count of values lying in that particular class.

For example, between [11–15] we have 11,11,12,13,14,11 thus ‘Frequency’ for ‘Class [11–14.901]’ is 6.

In the above function, we also calculated Relative Frequency and Probability Density for every class which we will try to calculate now.

Step-3: Calculate Relative Frequency

Relative Frequency formula

Hence, for class-1 [11–14.901] is 6/(6+3+1+0+1+0+0+1+1+1) = 0.428

Step-4: Calculate Probability Density

For probability density we need two things:

  • number of bins or intervals(represented as n_h)

There are various ways by which we can find n_h:

  1. sqrt(n)
  2. log2(n)
  3. 2n^(1/3)

Here, n is the number of records in the dataset.

  • bins width(represented as h)

h = (max value — min value) / n_h

For example, in the above-used dataset, the min and max values are(11,50) and n_h = 10. Hence, h = (50–11) /10 =3.9

Now,

How Probability Density gets calculated?

Therefore, for class-1 probability density is (6 / ((14) * 3.9) = 0.1986

Below are some plot-by-plot comparisons:

Comparison b/w self-written function and MatplotLib generated prob densities

Cumulative Frequency

  • It is the sum of the frequency for a class & all the previous classes.
def cum_sum(inp_data):
"""
Description: This function calculates the cumulative sum.
"""
cum_sum = []
cum_sum.append(inp_data[0])
for i in range(1,len(inp_data)):
cum_sum.append(cum_sum[i-1]+inp_data[i])
return cum_sum
## Self calculated cumulative relative frequencies
hist_data_results['Cum_Rel_Freq'] = cum_sum(hist_data_results['Relative_Freq'])
hist_data_results['Cum_Rel_Freq']
Class wise Cumulative Frequencies
Comparison b/w self-written function and MatplotLib generated cumulative relative freqs
Seaborn and self-generated plots

The only difference here is that I have used bar_plot() in the plot_hist() for comparing the graphs.

How to generate the probability distribution function using Kernel Density Estimator(KDE)?

We can smoothen the histogram to create a probability distribution and one of the ways of achieving it is KDE. In KDE, a Gaussian kernel is created for every data point and overlapping kernels are grouped together or added to create the overall PDF.

Courtesy: Wikipedia

The above figure shows the comparison of the histogram (left) and kernel density estimate (right) constructed using the same data. The six individual Gaussian kernels are the red dashed curves, the kernel density estimate the blue curves. The data points are the rug plot on the horizontal axis.

Let’s try to implement this:

KDE with Gaussian Kernels

Discrete Variable: Probability Mass Function

Discrete variable plot

PMF comparisons

Self-written function results
MatplotLib generated PMF

Here, we got the equiprobable plot.

Seaborn generated KDE plot
Self-generated KDE plot

Continuous Random Variable(Gaussian): PDF

Histogram of continuous variable

PDF comparisons

Probability Densities: Self-implementation
MatplotLib generated PDF

KDE Comparisons

Seaborn generated KDE
KDE: Self-implementation

The above plot shows the normal bell-shaped curve.

Effect of a lower or higher value of bandwidth on KDE?

Case-I: KDE with a smaller value of h

Squiggly KDE plot with h=0.5

Case-II: KDE with a higher value of h

Falt KDE plot with h=25

A larger value of h gives a flat KDE plot.

Hurray, you have reached the end of this blog and I hope you would have enjoyed it. Don’t forget to tap some claps.

--

--

Rajesh Sharma
Geek Culture

It can be messy, it can be unstructured but it always speaks, we only need to understand its language!!