Descriptive Statistics All In One Cheat Sheet (Part-2)

About quantiles, five-number summary, boxplot, covariance, correlation, skewness, kurtosis

Data Striver
7 min readMar 15, 2023
Photo by Edge2Edge Media on Unsplash

Introduction

This is a continuation of the first part. Here I am going to discuss Descriptive Statistics Measures like the Measure of Position which consist of topics like Quantiles, Five Number Summary, and BoxPlot and the Measure of Shape which consist of topics like Skewness, and Kurtosis. Also, discuss topics like Covariance and Correlation.

Quantiles

Quantiles are statistical measures used to divide a set of numerical data into equal-sized groups, with each group containing an equal number of observations.

There are several types of quantiles used in statistical analysis, including:

  • Quartiles: Divide the data into four equal parts, Q1 (25th percentile), Q2
    (50th percentile or median), and Q3 (75th percentile).
  • Deciles: Divide the data into ten equal parts, D1 (10th percentile), D2
    (20th percentile), …, D9 (90th percentile).
  • Percentiles: Divide the data into 100 equal parts, P1 (1st percentile), P2
    (2nd percentile), …, P99 (99th percentile).
  • Quintiles: Divide the data into 5 equal parts
  • Things to remember while calculating these measures:1.Data should be sorted from low to high.
    2. You are basically finding the location of an observation.
    3. They are not actual values in the data.
    4. All other tiles can be easily derived from Percentiles.

Percentile:

A percentile is a statistical measure that represents the percentage of observations in a dataset that fall below a particular value. For example, the 75th percentile is the value below which 75% of the observations in the dataset fall.

Formula to calculate the percentile value:
PL = (p)*(N+1)/100

where:
• PL = Percentile location in the dataset i.e position in sorted data
• N = The total number of observations in the dataset
• p = Desired percentile

e.g dataset: 45, 53, 57, 60, 68, 72, 75, 80, 85, 90 (sorted already)
PL(75) = 75*(10+1)/100 = 8.25

This means that the 75th percentile score is the value that falls at the 8.25th position in the sorted data. Since this position is not a whole number, we take the average of the values at the 8th and 9th positions: 75 and 80.

Therefore, the 75th percentile of this dataset is 77.5.

Five Number Summary

The five-number summary is a descriptive statistic that provides a summary of a dataset. It consists of five values that divide the dataset(which is sorted) into four equal parts, also known as quartiles. The five-number summary includes the following values:

1. Minimum value: The smallest value in the dataset.

2. First quartile (Q1): The value that separates the lowest 25% of the data from the rest of the dataset.

3. Median (Q2): The value that separates the lowest 50% from the highest 50% of the data.

4. Third quartile (Q3): The value that separates the lowest 75% of the data from the highest 25% of the data.

5. Maximum value: The largest value in the dataset.

The five-number summary is often represented visually using a box plot, which displays the range of the dataset, the median, and the quartiles.
The five-number summary is a useful way to quickly summarize the central tendency, variability, and distribution of a dataset.

Box Plot:

It is an extension of the Five Number Summary where Q1, Q2 and Q3 represent 25th, 50th, 75th percentile respectively

  • IQR stands for Interquartile Range and is calculated as the difference between Q3 and Q1
  • “Minimum” here is not the minimum value in the dataset rather it is calculated as follows:
    Minimum = Q1- 1.5*IQR
    If the minimum value is not in the dataset, then the closest value greater than the minimum value is considered the minimum.
  • “Maximum” here is not the maximum value in the dataset rather it is calculated as follows:
    Maximum = Q1+ 1.5*IQR
    If the maximum value is not in the dataset, then the closest value smaller than the maximum value is considered the maximum.
  • Below Minimum and above Maximum value in Box Plot is considered outliers
  • If the median line in a box plot is exactly in the middle of the box, then it suggests that the distribution of the data is uniform or evenly distributed. If the median line is not in the middle of the box, then the distribution is not uniform and is skewed. If the median line is closer to the lower end of the box, then the distribution is said to be negatively skewed, and if it’s closer to the upper end of the box, then the distribution is positively skewed.
  • so box plot helps to identify outliers, skewness, distribution of data

Covariance:

Covariance is a statistical measure that describes the degree to which two variables are linearly related. It measures how much two variables change together, such that when one variable increases, does the other variable also increase, or does it decrease?

If the covariance between two variables is positive, it means that the variables tend to move together in the same direction.
If the covariance is negative, it means that the variables tend to move in opposite directions.
A covariance of zero indicates that the variables are not linearly related.

Here “µ” is the population Mean and x̄ is the sample mean
Data about employee Experience vs Salary

In the above example, covariance is 86/5 = 17.2 which shows a positive relationship. With an increase in experience salary also increases.

The limitation of covariance is that it does not tell us about the strength of the relationship between two variables, since the magnitude of covariance is affected by the scale of the variables and is also greatly sensitive toward outliers.
It is good to show the direction of the relationship but not the strength of the relationship.

Correlation

Correlation refers to the statistical relationship between two variables, indicating the degree to which they are associated. A correlation coefficient is used to measure the strength and direction of the relationship between the variables, with values ranging from -1 to 1. A coefficient of 1 indicates a perfect positive correlation, 0 indicates no correlation, and -1 indicates a perfect negative correlation.

σx and σy are standard deviations of x and y respectively

σx = (86/5 )⁰.5 = 4.1472
σy = (94/5)⁰.5 = 4.335
cov(x,y) = 86/5 =17.2
Corr(x,y) = 17.2/(4.1472*4.335) = 0.956 =~ 1

Correlation and Causation

The phrase “correlation does not imply causation” means that just because two variables are associated with each other, it does not necessarily mean that one causes the other. In other words, a correlation between two variables does not necessarily imply that one variable is the reason for the other variable’s behaviour.

Suppose there is a positive correlation between the number of firefighters present at a fire and the amount of damage caused by the fire. One might be tempted to conclude that the presence of firefighters causes more damage. However, this correlation could be explained by a third variable — the severity of the fire. More severe fires might require more firefighters to be present, and also cause more damage.

Skewness

Skewness is the measure of asymmetry in a probability distribution is defined by Skewness. It can either be positive, negative, or undefined.

  • Positive Skew — This is the case when the tail on the right side of the curve is bigger than that on the left side. For these distributions, the mean is greater than the mode.
  • Negative Skew — This is the case when the tail on the left side of the curve is bigger than that on the right side. For these distributions, the mean is smaller than the mode.

The most commonly used method of calculating Skewness is

If the skewness is zero, the distribution is symmetrical. If it is negative, the distribution is Negatively Skewed and if it is positive, it is Positively Skewed.

Kurtosis

Kurtosis describes whether the data is light-tailed (lack of outliers) or heavy-tailed (outliers present) when compared to a Normal distribution. There are three kinds of Kurtosis:

  • Mesokurtic — This is the case when the kurtosis is zero, similar to the normal distributions.
  • Leptokurtic — This is when the tail of the distribution is heavy (outlier present) and kurtosis is higher than that of the normal distribution.
  • Platykurtic — This is when the tail of the distribution is light( no outlier) and kurtosis is lesser than that of the normal distribution.

This is the end of part 2 of the Descriptive Statistics Cheat Sheet All in One.
There is a first part where, I discussed topics like the Measure of frequency, Measure of Central Tendency, and Measure of Dispersion which consist of topics like Mean, Median, Mode, Variance, Standard Deviation etc. Follow this link to read that.

Thanks for Reading!

If you like this post follow me on Medium and connect me with on LinkedIn

--

--

Data Striver

Hey, My name is Tarun Kumar Mohapatra . I strongly believe that knowledge gets multiply by sharing with other