Shape Up Your Stats: A Guide to Measure Data Shape and Distribution

Arun Prakash Asokan
7 min readFeb 3, 2023

--

Hey there ! Next up on my Statistical Symphony Series is on measure of shape of data. I’m thrilled to share my knowledge on this topic with you all. Are you tired of having your data drowned in numbers and unable to understand its shape and distribution? Well, this article is just what you need! I’ll be taking you on a journey through the world measuring the shape and distribution of your data. Do check out my previous articles where, I have discussed about measure of central tendency and spread of data for better continuity.

Whether you’re a beginner or a seasoned pro, I promise that you’ll leave this article with a newfound appreciation for how easy and important it is to measure the shape and distribution of your data.

Symmetricity

Symmetry refers to the property of a dataset where the left and right sides of the distribution are mirror images of each other. In other words, if we were to fold the dataset along the center, the two sides would match up perfectly.

A dataset that is symmetric will have the same shape on both sides of the center point, and the mean, median and mode will be the same. This is the case for a normal distribution, which is symmetric around the mean.

For example, imagine a dataset of the heights of 100 people. If the distribution of heights is symmetric, the majority of the people will be of average height, and there will be fewer people who are very tall or very short. This would mean that the mean, median and mode of the dataset would be the same.

Another example would be a dataset of the scores of a test taken by 100 students. If the distribution of scores is symmetric, the majority of the students will have scored around the average, and there will be fewer students who scored very high or very low. This would mean that the mean, median, and mode of the dataset would be the same.

In real-world situations, data may not be perfectly symmetric. However, understanding the level of symmetry in the data can provide important insights into the underlying patterns in the data. Symmetry is often used in conjunction with other measures of shape, such as skewness and kurtosis, to understand the overall distribution of the data.

How do we measure symmetry in a dataset:

  1. Visual inspection: One of the simplest ways to determine symmetry is by creating a histogram or a boxplot of the data and visually inspecting the shape of the distribution. If the distribution is symmetric, the left and right sides will be mirror images of each other.
  2. Measures of central tendency: The mean, median, and mode are measures of central tendency that can be used to determine symmetry. In a symmetric dataset, the mean, median, and mode will be the same. In a positively skewed dataset, the mean will be greater than the median, and in a negatively skewed dataset, the mean will be less than the median.
  3. Measures of kurtosis: Kurtosis is a measure of peaked-ness that can be used to determine symmetry. A kurtosis coefficient of 0 indicates that the data follows a normal distribution, which is symmetric.
  4. Measures of skewness: Skewness is a measure of asymmetry that can be used to determine symmetry. A skewness coefficient ranges between -1 and 1. 0 indicates that the data is symmetric or a perfect symmetry, while a coefficient of -1 or 1 indicates a complete asymmetry.

It’s important to note that no method is foolproof and it’s always better to use a combination of methods to get a better understanding of the symmetry of a dataset.

Skewness

Skewness is a measure of the asymmetry of the data distribution. It describes how much the data deviates from being symmetric (a normal distribution). A normal distribution is symmetric around the mean, which means that the left and right sides of the distribution are mirror images of each other.

A dataset is considered to be positively skewed if the tail on the right side is longer or fatter than the tail on the left side. This means that the majority of the data is concentrated on the left side of the distribution and there are a few high values on the right side.

Right Skewed or Positively Skewed or Long Right Tailed Distribution

For example, imagine a dataset of the incomes of 100 people, where most of the people have low incomes and a few have high incomes. The distribution of incomes would be positively skewed, with a long tail on the right side representing the high incomes.

A dataset is considered to be negatively skewed if the tail on the left side is longer or fatter than the tail on the right side. This means that the majority of the data is concentrated on the right side of the distribution and there are a few low values on the left side.

Left Skewed or Negatively Skewed or Long Left Tailed Distribution

For example, imagine a dataset of the ages of 100 people, where most of the people are young and a few are old. The distribution of ages would be negatively skewed, with a long tail on the left side representing the old ages.

In positively skewed data sets, mean > median > mode

In a normal distribution, mean = median = mode.

In negatively skewed data sets, mean < median < mode.

The most common method is to use the skewness coefficient, which is calculated by taking the sum of the cubed deviations of each data point from the mean, divided by the number of data points, and then dividing by the cubed standard deviation.

The skewness coefficient can be interpreted as follows:

  • A skewness coefficient of 0 indicates that the data is symmetric.
  • A skewness coefficient greater than 0 indicates positive skewness, which means that the tail on the right side of the distribution is longer or fatter than the tail on the left side, and the mean is greater than the median.
  • A skewness coefficient less than 0 indicates negative skewness, which means that the tail on the left side of the distribution is longer or fatter than the tail on the right side, and the mean is less than the median.

It’s important to note that skewness is affected by outliers, so it’s better to use other measures of central tendency and dispersion in conjunction with skewness to get a better understanding of the data distribution.

Kurtosis

Kurtosis is a measure of the peaked-ness of the data distribution. It describes how much the data deviates from a normal distribution in terms of the tails of the distribution.

A normal distribution is considered to be mesokurtic, which means that it has a moderate peak and tails that are not too thick or too thin.

A dataset is considered to be platykurtic if the distribution is flatter than a normal distribution, and leptokurtic if the distribution is more peaked than a normal distribution.

For example, imagine a dataset of the heights of 100 people. If the distribution of heights is platykurtic, the majority of the people will be of average height, and there will be fewer people who are very tall or very short. This would mean that the kurtosis of the dataset would be less than 3.

Another example would be a dataset of the scores of a test taken by 100 students. If the distribution of scores is leptokurtic, the majority of the students will have scored around the average, and there will be fewer students who scored very high or very low. This would mean that the kurtosis of the dataset would be greater than 3.

To measure kurtosis, we can use the kurtosis coefficient, which is calculated by taking the sum of the fourth power of the deviations of each data point from the mean, divided by the number of data points, and then dividing by the fourth power of the standard deviation.

A kurtosis coefficient of 0 indicates that the data follows a normal distribution, which is mesokurtic.

A coefficient greater than 0 indicates leptokurtic data. Positive Kurtosis indicates that the distribution is more peaked than a normal distribution.

Negative Kurtosis i.e., kurtosis coefficient less than 0 indicates platykurtic data, meaning, the distribution is flatter than a normal distribution

It’s important to note that kurtosis is affected by outliers, so it’s better to use other measures of central tendency and dispersion in conjunction with kurtosis to get a better understanding of the data distribution. Also, kurtosis is often used in conjunction with other measures of shape such as skewness to understand the overall distribution of the data.

That’s a wrap! Clap if you like :) I’m Arun Prakash Asokan. Do check out the next article in Statistical Symphony Series on Central Limit Theorem. Stay tuned for the next one! See you soon !

--

--

Arun Prakash Asokan

Passionate Data Scientist | AI Intrapreneur | Ardent Teacher | Personal Finance Enthusiast. Follow me for rich content on AI, Statistics, Tech, Personal Finance