Intern Diaries: Statistics

Published in

Analytics Vidhya

9 min readJan 6, 2021

In this blog we learn to deal with numbers with statistics. This blog would be a quick refresher to the concepts already learnt in high school mathematics and would also introduce new topics in context with Machine Learning.

Why do we need to study statistics?

In this day and age, we have a plethora of data which needs to processed and understood. But how does one even start with performing operations on a dataset? It starts when, one has enough information on the given data to analyze and process it. This is done by statistics. Statistics is nothing but the practice of collecting huge amounts of information and inferring certain details out of it for later use. Below are some of the applications of statistics in machine learning:

cleaning and preparing dataset
aids in model selection
helps in obtaining information from the set of observations and creating inference on them
helps in finding the correlation between different variables and data

Quick refresher for basic topics

Percentiles: If some number in a sequence falls under n% percentile then, that indicates that, that number is greater than n% of the numbers in that series. We find percentile by first sorting our sequence in ascending order.

Then, we calculate it by using formula n =( P/100)*N

where, n is the rank in your sequence , P is the percentile to be calculated and N is the number of values in the dataset or sequence.

Quantiles: They constitute the 25th , 50th , 75th and 100th percentile of your dataset.
Mean: It’s the sum of values divided by the number of values of the dataset.
Median: There are some cases in which the measure of the central tendency given by mean gets deviated alot. And this leads to wrong inferences regarding data. In order to curb this, we sort the given data and pick the middle value (or the average of 2 middle values if number of values is even). This middle term is our median. This helps in reducing the effect of outliers. For eg: [1, 2, 5 ,6 ,45] in this case, 45 is the outlier whose effects we want to reduce. So, we choose 5 as our central tendency measure.
Mode: The most frequent value of the dataset.
Standard deviation and Variance: Variance is calculated as,

∑ (xᵢ-μₓ)(xᵢ-μₓ)/n from i =1 to i=n

where n is the no. of values in dataset μₓ is the mean for given data values. The standard deviation is simply the square root of variance.

Important Concepts

Population: It is a pool of similar items. The number of such items in a population is normally so large, that making a dataset out of it becomes a tedious job. For example, if we want to collect data on the number of people affected with COVID-19 in a certain state, then the people of that state becomes the population.
Sample: Again similar to population, it is a set of similar items but it is much smaller in size. Thus, sample is a subset of the population. Being smaller in size, its easier for us to collect data and compute on it. For the previous example, we could maybe take one or two random areas of that state as sample data.
Random Variable: it just denotes an outcome related to some random event. It can be of two types: discrete and continuous. A discrete random variable, can only have whole values. Example: any random person would have children in whole numbers. A continuous random variable, can have values in real numbers at a given range. For example, any random house would have its price in real numbers.

Gaussian or Normal Distribution:

It is a bell shaped probability distribution curve for a continuous random variable. It is centered around the mean value. It gives us the probability density for the possible values of that random variable. The properties of a GD for a mean μ and standard deviation σ involves:

68% of the values assumed by the continuous random variable, lies between the first σ at the left and first σ to the right.
95% of the values of the random variable lies between second σ on either side of the curve.
99.7% of the values of the random variable lies between third σ on either side of the curve.

Function involved in Gaussian distribution curve

This kind of curve can be seen for a dataset containing height or blood pressure measurements to name a few.

Log normal distribution curve:

It has a similar shape to that of Gaussian distribution curve but it is more right skewed i.e., towards its right side it gets longer and longer. This distribution can be defined for a continuous random variable as the normal distribution of the log of all the values assumed by this random variable or,

Given an x ∈ X (random variable), it has a normal distribution curve of the log(x) for a given μ and σ.

The example of such curve can be observed on a dataset concerned with the income of the people. The logic behind this is, there are smaller amount of people (relatively) with low income, as we increase the income we see that the income which is neither low nor high(middle-class) is more likely to be seen. Then we as we increase the income more, we see that the curve gets flattened. This is because only fewer and fewer people have an income that high.

Standard Normal Distribution:

it is a normal distribution when μ=0 and σ=1. When we have different features with drastically different range of values, we scale it down by converting their respective probability distributions to the standard normal distributions. Scaling down features in such a way helps in increasing the accuracy of the model.

Covariance:

It provides a measure through which we can gauge relationship between two random variables. It can be defined mathematically as,

cov(x,y) = ∑ (xᵢ-μₓ)(yᵢ-μᵧ)/n

where, μₓ and μᵧ are the means of the values assumed by the random variables x and y while n is the number of data values. The ∑ takes places from i=1 to i=n.

Looking at the above formula we can see some similarities between variance and covariance. While covariance deals with 2 random variables, our variance deals with only one and thus, instead of (yᵢ-μᵧ) we will have a second (xᵢ-μₓ) for variance formula:

var(x,x) = ∑ (xᵢ-μₓ)(xᵢ-μₓ)/n

Direction of covariance:

If cov(x,y) is positive then, x and y will grow in the same direction.
If cov(x,y) is negative then x and y will go in opposite directions.

Limitation: Covariance between x and y doesn’t tell us the extent of positivity or negativity among the random variables. It just tells the relative direction.

Pearson Correlation Coefficient (ρ(x,y)):

In order to overcome the above limitation, pearson correlation coefficient was introduced. This is used to not only tell the direction of a relationship between two random variables, but also its strength. It can be found as:

ρ(x,y) = cov(x,y)/(σₓ . σᵧ)

where, σₓ and σᵧ are standard deviations for respective variables. The value of ρ(x,y) lies only between -1 and +1 (both inclusive). We will see different values of ρ(x,y) and how we interpret them.

That’s one way of finding out pearson correlation coefficient: using np.std() and np.cov() functions. The graph and correlation for above x and y values is given in the first interpretation figure.

The values below are calculated using the above method and numpy’s corrcoef() function.

This function normally returns a 2 by 2 matrix where, at position [0][0] and [1][1] it shows correlation for x and y individually respectively while at [0][1] and [1][0] it shows the correlation between x and y.

Interpretation:

The more closer our pearson coefficient is to 1, there is more correlation and the variables are more or less directly proportional to each other.

For these values of x and y we have graph and coefficient values given below:

Compared to the last image, its pearson coefficient is less which ultimately shows that there is more randomness to it then the former graph.

The coefficient is negative. This suggests that as x increases, y decreases and being very close to -1, we can see that the points almost align with the red line. They are inversely proportional and there is very less randomness.

The more closer the coefficient is to zero the more randomness is seen in the graph and there seems no real correlation between the two random variables.

Spearman’s rank correlation coefficient:

Pearson coefficient seems nice and all but it still has a limitation: it only shows the extent of linear correlation between the two random variables i.e., even if the actual correlation between the data is high, pearson coefficient might give it a lower value due to lesser linearity between the variables.

Spearman’s rank coefficient however, deals with this problem. Instead of using the values of x and y directly, it uses the ranks of x and y values in the pearson coefficient formula:

ρ = cov(Rₓ , Rᵧ)/(σᵣₓ . σᵣᵧ)

where, Rₓ and Rᵧ are the ranks of the elements of x and y random variables while, σᵣₓ and σᵣᵧ is the standard deviation of the ranks of x and y.

This coefficient is used in heatmaps and exploratory data analysis.

Procedure to find Spearman’s rank coefficient:

First sort the column x in ascending order and assign a rank to each element in that column starting from 1.
Then we assign the ranks to the corresponding y values starting from 1.

Sorting the dataframe on the basis of “X” value and then ranking by X and Y.

Then, we find the difference between the rank of x and rank of y and find its square.

Find the sum of these squares ∑(di)² and apply it on the formula given below:

ρ = 1- 6*∑(di)² /n*(n²-1) .

The above formula is used only if all n-ranks are distinct.

we get the value, for the above data as ,

Note: finding correlation helps us to find features which might be dependent on each other. If multiple features like these exists then besides one the remaining features are removed as they provide no new information on the dataset.

Finding outliers in dataset:

Given below is the example dataset,

Outliers: these are the data points which are have large separation from the other data points. Outliers can either be caused due to human error in collection of data or there is too much variation in dataset. Sometimes they are added as a dummy for testing the detection methods. Outliers can cause unwanted changes in mean and standard deviations.
Z-scores: In a probability distribution graph, these z-scores tells us how far our data point is from the mean of the dataset. Here, we see that if the Z-score of a data point lies outside of the 3rd standard deviation range, then it is considered as an outlier. The z-score formula for conversion of dataset into standard normal distribution form is:

Z = (xi-μ)/σ for i =1 to i = n where n is the size of dataset. (Z≤3 for point to not be an outlier).

Code snippet for finding outliers using z-scores. Here, limit is the value of the standard deviation outside of which a data point would be considered as an outlier.

IQR: Interquartile range refers to the difference between 75th and 25th percentile of the dataset. The method of calculating percentiles is already stated in the refresher section. In this method, if our data point lies in between q1-IQR*1.5 (lower bound) and q3+IQR*1.5 (upper bound) then, it is NOT an outlier. Here, q1 and q3 represents 25th and 75th percentile respectively.