Important Statistics concept every Data Scientist/Machine Learning Engineer should know : Part 2

Raushan Joshi
5 min readJan 26, 2023

--

Image from Indospanishculture

Introduction

Great! Now continuing forward to cover other concepts of Statistics for Data Science/Machine Learning, It is also important to understand their practical use cases in real Machine Learning life cycle. In this part, I will explain some of those concepts. This will require prior understanding of few concepts that you can refer here.

In Machine Learning, Experimentation and Regular Updating is the key for achieving best model performance for longer periods. With changing nature of data behaviour, which fuels machine learning models, monitoring of the model’s performance and data are very necessary.

Here, one common statistical concept called moments are used to compare different data distributions and help in analysing that change value.

What are moments in Statistics? How are they used ?

Moments are ways of quantifying and describing the shape of a probability distribution. There are several types of moments that can be calculated in order to compare different probabilities distributions.

*The Mean or first moment*
It is the expected value [ E(X) ] of a random variable. It usually measures the center of a probability distribution. When data values change in terms of scale(e.g doubled), we can see the change in mean value of the data distribution.
For example: Imagine a Machine learning model about prediction of income levels for job profiles. The data used 10 years back is not proper as income levels for same job profile has been scaled up now. Thus, the mean income for a job profile has changed which can be detected using the first moment calculation.

Note: Mean is considered to be most widely used moment in monitoring metrics. In Real world example, mean value of most data always changes with time.

*The Variance or second central moment*
It measures the spread of a distribution around the mean. When data values changes but not each data points change in same proportion. This can give rise to different variation in observed data distribution in a shrinking or expanding effect.
For example: Imagine a Machine learning model about prediction of portfolio for investment. Five years old data shows people investing most of their money in Large and Medium Cap Portfolio. But due to huge economic growth last year, people are investing larger proportion of their money in small caps as compared to past.

Thus, even if mean amount of money spent is same but they are spread differently.

Note: Variance is also widely used moment as monitoring metrics in a ML problem with ever increasing diverse data points but with focus only on top X% around mean values.

*The Skewness or third central moment*
It measures the asymmetry of a probability distribution as compared to Normal distribution. This asymmetry happens when data values are either concentrated towards lower side or higher side of the mean. This can affect the model performance in ML, also termed as data skew. It is calculated using formula also called Pearson’s Median Skewness:
skewness = 3*(Mean -Median) / sqrt(Variance)

Skewness is defined into two types:
1. Positively Skewed: When the distribution has a long tail on the right side and the majority of the values are concentrated on the left.
Here, Mean > Median > Mode and skewness > 0
2. Negatively Skewed: When the distribution has a long tail on the left side and the majority of the values are concentrated on the right.
Here, Mean <Median <Mode and skewness <0

Note: It’s worth noting that skewness is very useful metrics to understand the customer behaviours in a certain environment or any major event. This is highly used as monitoring metrics in Customer based ML models like recommendations.

*The Kurtosis or fourth central moment*
It measures the “peakedness” of a probability distribution as compared to Normal distribution. It quantifies how much of the probability mass is concentrated in the tails of the distribution. It is very helpful in detecting outliers. The expected value of kurtosis for Normal distribution is 3.
Thus, kurtosis for any distribution is measured in excess term.
kurtosis(X) = E[(X-E(X))⁴] / (E(X²))²
Excess kurtosis = kurtosis(X) -3

1. For Leptokurtic : positive value of excess kurtosis, have high peak.
2. For Platykurtic : negative value of excess kurtosis, have low peak.
3. For Mesokurtic : excess kurtosis is almost zero

Image from SciPy Docs

Note: It’s worth noting that it is not always easy to interpret and do computationally intensive measurement. But It is often used in Financial modelling to calculate “peakedness” and “tailedness” as main metrics.

In a field of Machine Learning/Data Science, It is very important to understand the data distribution graphs and do analysis using concepts discussed here and in previous part. These concepts will be used in the starting part like Data Exploration(before model building) and also after model building, during Deployment/Serving the model in Production.

Thanks for the reading. I hope that above concepts will help you with your learning journey in Machine Learning/Data Science. Further, I will try to explain other useful practical concepts. Keep Reading and Learning more!
P.S : Please do give claps if above blog help you. Also, Follow me
:)

--

--