All about statistics for Data Science(part1)

Sumaya Bai
4 min readMar 8, 2022

--

Image Source : https://www.edx.org/course/statistics-unlocking-the-world-of-data

How much statistics is too much statistics? Well, this post will give you an exact answer for the above question.
Through this post I will try to cover the key statistics topics for data science.

As we have read allot of times everywhere that 80% of time a data analyst or a data scientist will devote it into pre processing of the data so as to build an accurate machine learning model. One cannot underestimate the importance of statistics. If we were to dig out insights from the data we are basically taking out the hidden possibilities in the data with the help of statistical knowledge which is very essential in the world of data.

I’ll be splitting this story into two; Descriptive statistics and Inferential statistics.
In this post, I’ll try and concentrate more on the Descriptive statistics.

Basic difference between descriptive and Inferential statistics is that, with the help of descriptive statistics we can try to describe the information by using bar diagram, pie chats etc., whereas in inferential statistic we try to infer from these charts and diagrams to get insightful information which will help in decision making.

Descriptive statistics :

There are three major dimensions of descriptive statistics.
1) Measure of Central Tendency,
2) Measure of Dispersion and
3) Shape of the data.

  1. Measure of Central Tendency:
Image source : https://365datascience.com/tutorials/statistics-tutorials/measures-central-tendency/

This dimension mainly focuses where exactly the data is located and also typically tries to find out the center of distribution of the data.
They are of 3 types:
a) Mean : Mean is also known as the average of the data. It is basically sum of all the elements by the number of elements in the dataset.
Mean = X1 + X2 + X3 +… + Xn / n
For e.g., If a grocery store owner wants to know the average sales for the last month, he can use the mean to get the answer.
b) Median : Median is the center point of the dataset when arranged in ascending or descending order. It divides the dataset into two halves.
if the dataset has odd number of observations, median is the middle most observation whereas if the dataset has even number of observation, median is the average two middlemost observations.
For e.g., If the same grocery store owner wants to know the median sales for one week, he could use the median.
c) Mode : Mode is the frequently occurred observation in a dataset. There can be more than one mode in a dataset.
For e.g., if the grocery store wants to know if the sales volume is same in any days of the week he can use the mode.

2) Measure of Dispersion :

Image source : https://protonstalk.com/statistics/measures-of-dispersion/

Dispersion is basically to understand how far the data points are stretched.
There are few methods to measure this dispersion:
a) Range : Range is the difference between maximum value and minimum value in the dataset.

We shouldn’t be using range when we can see that either the maximum or minimum value is an outlier.
b) Variance : Variance measures how far the data points are spread out from the mean. A high variance tells us that data points are spread far away from the mean and a small variance tells us that the data points are closer to the mean of the data set.

c) Standard Deviation : Standard deviation is nothing but the square root of the variance.

d)Quartiles : Quartiles is the measure that divides the dataset into four eqaull parts.

3)Shape of the data :

The shape of the data is important because it can help in making decision on the probability of the data.
There are mainly two methods:
a)Symmetric : In this shape, data is distributed the same on both the sides.
b)Skewness : Most of the time data isn’t symmetric, it can be either skewed to the left or right. i.e.., positively skewed or negatively skewed.
i) Positively skewed : This is the case when the tail on the right side of the curve is bigger than that on the left side. For these distributions, mean is greater than the mode.
ii)Negatively Skewed: This is the case when the tail on the lefts ide of the curve is bigger than that on the right side. For these distributions, mean is smaller than the mode.

That’s all about the descriptive statistics. This post will be continued on to part2 where I'll be explaining the inferential statistics for data science.

Happy Reading Y’all :)

--

--

Sumaya Bai

Data enthusiasts, turning numbers into powerful stories. Let’s dive into the data world together!