Important Statistics concept every Data Scientist/Machine Learning Engineer should know : Part 1

Raushan Joshi
3 min readJan 22, 2023

--

Introduction

By far now, every intellects coming from non-STEM areas and exploring in the fields of Data Science or Data Analytics must have realized that statistics will always be part of their journey thereafter. Thus, It’s always recommended to build a strong basic foundation before moving forward.

Statistics, a branch of Mathematics which deals with the collection, interpretation, analysis and presentation of data. It is seen more as something which provides you the tools and methods for doing analysis and making inferences on data provided.

Let’s start with basic concepts

Starting with the basic concepts: In statistics, the mean is the average of a dataset, calculated by adding all the values in the dataset and dividing by the number of values. The median is the middle value of a dataset when the values are ordered from smallest to largest. The mode is the most frequently occurring value in a dataset.

For example, consider the following dataset of numbers: 3, 7, 8, 5, 7, 4, 1

The mean of this dataset would be (3+7+8+5+7+4+1) / 7= 5

The median of this dataset (after ordered from smallest to largest) would be middle element. In case of odd number of elements = 5 (1,3,4,5,7,7,8) and in case of even number of elements = (5 + 7)/2 = 6 (1,3,4,5,7,7,8,9)

The mode of this dataset is 7, as it is the most frequently occurring value. Generally, different types of mode based on number of modes exists in a dataset, for example : Unimodal, Bimodal, Trimodal and Multimodal. Surprisingly, sometimes when dataset contains only unique elements, it is said to have no mode at all.

Note : It’s worth noting that the mean, median, and mode are all measures of central tendency, which are used to describe the “center” of a dataset. The mean is generally the most common measure of central tendency, but the median and mode can also be useful in certain situations, particularly when the dataset has outliers (extreme values) or is not normally distributed.

The standard deviation(σ ,sigma) of a given dataset is a measure of the spread or variability of the data. It quantifies how far the individual data points in a dataset are from the mean (average) of the dataset. Mathematically, It is written as :

σ = sqrt(σ²) = sqrt(Σ(x_i — μ)² / N) ; μ = mean, x_i = data points, N = total

σ of above dataset ( 3,7,8,5,7,4,1 ) = sqrt((4+4+9+0+4+1+9)/7) = 2.10

Similarly, we have Variance which is square value of standard deviation(σ).
Variance(σ²)
of the above dataset = (2.10)² = 4.42

In general, a dataset with a low standard deviation has data points that are close to the mean, while a dataset with a high standard deviation has data points that are more spread out.

Note: It’s worth noting that when a dataset is skewed, the mean and standard deviation may not be good indicators of the central tendency and spread of the data, in such cases other measures such as median, interquartile range etc can be used.

Order Statistics, the values from a dataset that have been ranked in order. They are are often used in statistical analysis to describe the characteristics of a dataset. For example, In above dataset(3,7,8,5,7,4,1), the first order statistic would be the smallest value (1), the second order statistic would be the second smallest value (3), and so on.

The minimum and maximum values (also known as the first and last order statistics) are commonly used to describe the range of a dataset.

Other commonly used order statistics include the quartiles (which divide a dataset into quarters), the deciles (which divide a dataset into tenths), and the percentiles (which divide a dataset into hundredths).

Note : It’s worth noting that order statistics are often used to identify outliers (extreme values). In such cases, it can be helpful to summarize the distribution of a dataset and visualize the data using a Box plot.

Thanks for the reading. I hope that above points must gave you good understanding of the basic concepts. Further, More concepts will be explained in Part 2. Please do have a look.

--

--