Statistics in Machine Learning

Sameer Kumar
8 min readSep 12, 2020

--

Introduction

“If you torture the data long enough,it will confess to absolutely anything”

-British Economist

Data plays a very huge role in today’s world as it is the main ingredient across all the fields like finance,healthcare,sports,etc.Since the breakthrough of Internet several years back,tons of data is being generated everyday and used by the data analysts across fields to find patterns in the data and arrive at desirable conclusions.

Statistics plays a huge role when it comes to data analysis as it provides a set of mathematical tools which helps in statistical analysis and in making observations on a particular data set to arrive at a solution.

So let us see the importance of Statistics while creating an end to end Machine learning project and how are the concepts actually applied there.Let me first introduce some terminologies:-

Population Mean and Sample Mean

Population VS Sample

Suppose we have a data set of salary of the people of India and our task is to find the average salary of the entire data set.

Population refers to a group of objects from which we have to derive meaningful conclusions by performing mathematical operations .Here populations refers to the people of India.So in this case,the average salary of the people is considered to be the population mean.Since population data set contains large number number of data points,it does get difficult to perform operations on such kind of data.

So we extract few data points from the original data set and perform the operations only on that set.Let us say if we have 10,000 original data points,we take 1000 data points and perform operations only on that set.That extracted set is called as a Sample and the mean of that set is called sample mean which is used to estimate characteristics of whole population.

This process of selection of subsets from the population is called Sampling.

Random Variable and it’s types

Random variable is nothing but the features which are present in our particular data set.Let us say if we have to predict the selling price of the car,then sets like kilo meters driven,number of years,petrol/diesel type are called as features or random variable or independent variables and the quantity which we have to predict is called as a label which is our dependent variable.

Feature and Label

There are basically two types of Random Variable:

  1. Discrete Random Variable: A variable which is represented in form of whole number is called discrete RV. eg) no.of bank accounts,no of siblings.
  2. Continuous Random Variable: A variable which is present within a range of value and can have any value in b/w that. eg)height,weight etc
Type of data

Some basic mathematical operations on a data set

  1. Mean: Mean is nothing but the average of all the data points in that particular distribution.Suppose we have a data set of height of ten people and mean came out to be 161,so this 161 is a measure of Central Tendency which states that most of the data points will lie around this value.
Mean

Mean actually plays an important role during feature engineering as it helps to replace the NAN values with the mean of that entire column or feature.This is one of the most important applications of mean in a ML use case.

2) Variance : Variance measures the spread of the data set and is used to indicate or show how data points vary with respect to the mean of that sample.

3) Standard Deviation: Standard Deviation is the square root of Variance and both of these quantities just show the distribution of the data and how each element is located wrt to the mean.

Standard Deviation
Variance

Gaussian Distribution/Normal Distribution

Every Random variable which has a unique mean and standard deviation is said to follow a particular distribution called as Gaussian/Normal distribution(like a bell curve) when plotted on graph.It is a distribution which is symmetric about the mean showing that data near to mean are more frequent in that set.

Gaussian Distribution

So a random variable X with a mean and std deviation which follows Gaussian Distribution follow 3 properties called as empirical formula:-

  1. P(mean- one std deviation)<x<(mean+one std deviation) = 68%

This states that the probability of the elements/data points present in the first standard deviation(mean + one std dev) is 68.2%.It is also visible in the above graph that probability is around 68 % both in positive region as well as negative region.

2. P(mean- two std deviation)<x<(mean+two std deviation) = 95%

This states that the probability of the elements/data points present in the second standard deviation(mean + one std dev) is 95%.It is also visible in the above graph that probability is around 95% both in the positive region as well as the negative region.

3. P(mean- three std deviation)<x<(mean+three std deviation) = 99%

This states that the probability of the elements/data points present in the second standard deviation(mean + one std dev) is 99%.It is also visible in the above graph that probability is around 99% both in the positive region as well as the negative region.

Chebyshev Inequality

In the previous case,our random variable belonged to the Gaussian Distribution and thus followed those 3 principles.If our random variable does not follow Gaussian Distribution,we use the Chebyshev Inequality

The Chebyshev Inequality states that the probability of finding the points of the random variable will always be greater that 1–1/k*k.

Let m be the mean and s be the standard deviation and k be the range of standard deviation for which we have to find the probability of presence of data points and x belongs to random variable X.

P(m-ks<x<m+ks)>1–1/k*k

Put k=2 we get P=3/4 or 75%.

It means that whenever my random variable does no follow the Gaussian Distribution ,then more than 75% of the data points of random variable will be within the range of second standard deviation.

This concept helps in detection of outliers present in our data set.Outliers are data points which are distant from all other observations.

Covariance

Covariance is one of the most important topics when it comes to data pre processing as it helps to find the relationship between the features present in our data set.

Covariance helps to quantify the relationship as it gives the direction of the relationship between two features whether they both are directly proportional or inversely proportional.

Consider size and price of the shoe as two features,when size increases the price increases too,so it is directly proportional .So here we were able to establish the direction of relationship but covariance fails to tell the strength of the relationship.

The main disadvantage with covariance is that it does not tell the strength of the relationship.So there is a concept of Correlation coefficient which overcomes the problem.

Pearson Correlation

Along with determining the direction of relationship between any two features,Pearson Correlation Coefficient also determines the strength of that relationship by considering variance of those two features and by dividing that standard deviation with the covariance.

PCC

The value of r ranges between 1 and -1 where a value of 1 indicates a perfectly positive correlation and a value of -1 indicates perfectly negative correlation.

Covariance and Pearson Correlation have a huge importance when it comes to Feature Engineering.If there are two or three features in our data set having the same correlation coefficient,we can drop one of those two features as both of them have the same functionality leading to prevention of curse of dimensionality. This process is called as Feature Selection.

Identifying outliers in data set

An outlier is a data point in a data set that is distant from all other observations and it lies outside the overall distribution of the data set.

outlier

In the above graph,the red dot which is away from all observations is considered an outlier.

Impact of outlier

The presence of an outlier in a data set causes a significant impact on the mean and standard deviation.It is because of this reason we do not consider using mean to replace the missing values in a numerical data set.

But there are few ML use cases where the outlier plays a crucial role such as during credit card fraud detection,the fraudulent transactions are the rare ones as they act as an outlier and therefore considered important.

So there is one technique to identify the outlier called as Z score method.

Z score

Z score tells us whether the particular data point lies in within the 3 standard deviation or not in a standard normal distribution.

We calculate the z score for each data point in the data set and if the value of z score lies beyond the three standard deviation,it is considered to be an outlier and if the value is within the 3 standard deviations,then it is not n outlier.

Lower Z score means closer to mean and higher Z score means that the point is located away from the mean.

A positive Z score means the value is on the right side of mean and a negative Z score means that the point is on the left side of the mean.

A Z score of 0 indicates that the value of data point is identical to mean. whereas a Z score of 1 indicates that data point located a distance of one standard deviation from the mean.

Here is the Python Code which shows a method to find outliers in any random sample:-

Conclusion

So we have seen that how statistics play a huge role when it comes to any Machine Learning use case.Pipelines like feature selection,feature engineering depend a lot on these statistical computations as they help to identify the outliers,replace the missing values in the data set etc. It is fascinating to see the real life application of Mathematics and statistics in the field of data science and the importance of data in today’s world.Looking forward to keep learning and exploring the field of Machine Learning!

--

--

Sameer Kumar

AI Research intern at SCAAI || Kaggle 2x Expert || Machine Learning || Deep Learning || NLP || Python