Statistics At a glance

Published in

Analytics Vidhya

4 min readApr 6, 2020

Statistics is the heart of every machine learning model. Therefore, we must know the basic terminologies of statistics in order to understand our data well and further help to analyse and manipulate the data.

So, naturally a question arises: What can we learn from looking at a group of numbers?

In Machine Learning (and in mathematics) there are often three values that interests us: Mean, Median and Mode.

Mean, Median and Mode

Let us take a dataset instance:

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

Mean — It is the average value of the dataset. At its basic level, to calculate the mean, find the sum of all values, and divide the sum by the number of values in the dataset. We could also use NumPy module to calculate mean. Example :

import numpy as np
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = np.mean(speed)
print(x)

Median — It is the mid-point value found in the dataset after all the values are sorted in a particular order( either ascending or descending order). Note: If there are two numbers in the middle, divide the sum of those numbers by two to get median. We could also use NumPy module to find median. Example:

import numpy as np
speed = [99,86,87,88,86,103,87,94,78,77,85,86]
x = np.median(speed)
print(x)

Mode — It the most frequent occurring value in our given dataset; or we can say, the Mode value is the value that appears the most number of times. For example: In the given set of values-

99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 , "86" is our mode as it appears 3 times.

Also, we can find mode using SciPy module. Example:

from scipy import stats
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = stats.mode(speed)
print(x)

The mode() method returns a ModeResult object that contains the mode number (86), and count (how many times the mode number appeared (3)).

It returns value 86 as mode.

Does all data have a mean, median and mode ?

Yes and no. All continuous data has a median, mode and mean. However, strictly speaking, ordinal data has a median and mode only, and nominal data has only a mode. However, a consensus has not been reached among statisticians about whether the mean can be used with ordinal data, and you can often see a mean reported for Likert data in research.

Standard deviation

Standard deviation is a number that describes how spread out the values are.

A low standard deviation means that most of the numbers are close to the mean (average) value.

A high standard deviation means that the values are spread out over a wider range.

Example: This time we have registered the speed of 7 cars:

speed = [86,87,88,86,87,85,86]

The standard deviation is: 0.9

Meaning that most of the values are within the range of 0.9 from the mean value, which is 86.4.

Let us do the same with a selection of numbers with a wider range:

speed = [32,111,138,28,59,77,97]

The standard deviation is:

37.85

Meaning that most of the values are within the range of 37.85 from the mean value, which is 77.4.

As you can see, a higher standard deviation indicates that the values are spread out over a wider range.

Variance

It is the expectation of the squared deviation of a random variable from its mean. In other words, it measures how far a set of (random) numbers are spread out from their average value.

Uses: Variance analysis, also described as analysis of variance or ANOVA, involves assessing the difference between two figures.

Covariance

It provides the measure of the strength of the correlation between two or more sets of random variates.

Correlation

It is the most familiar measure of dependence between two quantities is the Pearson product-moment correlation coefficient, or “Pearson’s correlation coefficient”, commonly called simply “the correlation coefficient”. It is obtained by dividing the covariance of the two variables by the product of their standard deviations.