Descriptive Statistics

Published in

Probablity and Statistics for Data Science.

5 min readAug 28, 2018

Statistics can be defined as the branch of mathematics that deals with interpretation, analysis, collection and presentation of data.
It uses methodologies to draw conclusion from the given set of data.

Descriptive statistics as the name suggest is the statistic used to describe the data. It organizes the data in a way so that it can be easily understood, and some pattern might emerge from the data. No probability theory is involved in descriptive statistics.

Types

Descriptive statistics is generally divided in two categories :

Measure of Central Tendency

A measure of central tendency is a single value used to describe the data, a value which is present at the center of data.
Mean, mode and median are the measure of central tendency of the data.

Mean (Arithmetic)

The mean is equal to sum of all numeric values of data divided by total number of data. It is the value around which the data is spread. It can be used with both discrete and continuous data.
Let there be n values in a data set, x1, x2, …, xn, the sample mean, usually denoted by

To acknowledge that we are calculating the population mean and not the sample mean, we use “mu”, denoted as µ :

Mode

Mode is the number which occurs most of the time in the data set, the term having highest frequency.
It is mostly used for categorical data to know which category is occurring more.

If two values appears more than the rest of the values then the data set is bimodal. If three values appears more than the rest of the values then the data set is trimodal and for n modes, the data set is multimodal.

Median

Median is the middle value of data set sorted in either ascending order or descending order.
It is the value that divides the data set into two equal parts.
Median will be a middle term, if number of terms is odd it will be average of middle 2 terms, if number of terms is even.

Measure of Spread / Dispersion

Measure of spread denotes the variability within the data set. The most popular variability measures are the range, variance, and standard deviation, interquartile range (IQR).

Range

Range is the difference between the largest and smallest value of a data set.

Note
The range could be misleading as well,

Variance

It is the measure of how far the data set is spread. It is dependent upon mean and calculated by sum of square of difference between each data point and mean divided by total number of data points in the data set.

The problem with Variance is that because of the squaring, it is not in the same unit of measurement as the original data.

Standard Deviation

Standard deviation is a measure of dispersion of observations within a data set. It is the root mean square deviation.
It is the same unit of measurement as the original data. It describes the spread of data better. Standard deviation is the square root of variance .

Interquartile Range

The interquartile range (IQR) is a measure of statistical dispersion between 3rd quartile (75 percentile) and 1st quartile (25 percentile) of the data set.

Skewness

Skewness is a measure of the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
The skewness value can be positive or negative, or undefined.

For univariate data Y1, Y2, …, YN, the formula for skewness is:

where Y¯ is the mean, s is the standard deviation, and N is the number of data points.

The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left (left tail is long relative to the right tail) and positive values for the skewness indicate data that are skewed right (the right tail is long relative to the left tail).

Kurtosis

Kurtosis is the measure of degree of peakedness or flatness. Kurtosis is a measure of whether the data are heavy-tailed (more outliers) or light-tailed (lack of outliers)relative to a normal distribution.

n is the sample size, Xi is the ith X value, X is the average and s is the sample standard deviation.

There are three types of Kurtosis :

The normal curve is called Mesokurtic curve. If the curve of a distribution is more peaked than a normal or mesokurtic curve then it is referred to as a Leptokurtic curve. If a curve is less peaked than a normal curve, it is called as a platykurtic curve.

Thanks for reading.

If you like this post, give this post some claps for motivation . You can share this on Facebook, Twitter, Linkedin, so someone in need cross through this.

You can reach me at : linkedin.com/in/anant-jaiswal-b0a151129/