Understanding Histograms

Muhammad Azhar
Analytics Vidhya
Published in
5 min readApr 24, 2020

Histograms are graphs that demonstrate the distribution of your continuous data.

Why use Histograms?

Histograms offer a distinct advantage over summary statistics(Mean, Standard deviation, Average) as they provide insights that numerical summaries cannot capture.

While summary statistics are valuable for conveying information about data distributions, they inherently simplify the complexities of the dataset.

By combining graphical representations, such as histograms, with statistical values, we can enhance our understanding of sample data.

What are the uses of histograms?

Histograms can be used to understand the distribution of your continuous data.

Histograms and Central Tendency:

Histograms can be used to find the center of data samples, in the histogram given below, we can see the center of the data sample is between ‘-1 and 0’.

The Center of data is actually ‘mean’. In other words, histograms give information about the ‘mean’ of data.

Histograms can also find an overlap of two or more data distributions, which is very useful to find common values among data distributions.

Histograms and Variability:

Summary statistics can create a false concept of data distribution.

Suppose you have two data distributions and the only thing known about those is their ‘mean’. If the mean of both data distributions is the same then this information will lead us to believe that both distributions are practically equivalent.

However, if we graph those data distributions, we’ll find out that both distributions are not equivalent rather they are different, as shown below:

As we can see that both distributions (A & C) are notably different although both have the same mean value. Distribution A ranges from (40 to 70) while distribution C ranges from (10 to 90). Thus, “mean” doesn’t provide the complete picture of our data and can be misleading.

Summary statistics, such as mean & standard deviation, only offer partial information, while histograms give us more material to understand which values are more or less common in data.

Histograms and Skewed Distributions:

Understanding skewed distributions:

A distribution is said to be skewed if data points are not properly distributed but rather are clustered more towards one side of the scale.

Skewed distributions are asymmetric thus the curve is not symmetric. In other words, the right side of the graph is not the same as the left side of the graph or our distribution does not have the shape of a “Gaussian bell”. The skewness of distribution also affects the summary statistical value, such as mean, median, and mode.

The shape of data distribution is a critical attribute that can control the way you’ll find the central tendency to reflect the center of data as accurately as possible.

Skewed distribution can be described as:

  • Right Skewed / Positive Skewed
  • Left Skewed / Negative Skewed

Right Skewed Distribution:

In right-skewed distribution, most of our data is clustered on the left side of the graph extending its tail towards the right side.

Left Skewed Distribution:

Left-skewed distribution has most of its data points clustered on the right side of the graph extending its tail towards the left side.

Histograms are an excellent tool for finding the shape of data distribution as well as the skewness of the data. Additionally, the shape of data distribution affects what kind of hypothesis testing we’ll be using either parametric or non-parametric. (I’ll talk about these types in a separate blog.)

Thus, histograms can be very useful in determining summary statistical values as well as hypothesis tests that can be most effective for our data distribution.

Using histograms to find outliers:

What are outliers?

In statistics, An outlier is defined as an observation or data point that significantly deviates from the other observations within a dataset. Experimental errors or errors during the data collection process can introduce outliers in our data. However, outliers may reflect the true variability inherent in the data rather than being a result of error or unusual conditions.

Histograms can identify outliers as they appear, as an isolated bar, far from the normal distribution.

Identifying Multimodal Distribution

A multimodal distribution has more than one peak or ‘mode’ in its probability distribution graph. The mode is the value that appears most often in data distribution, and it’s the value at which the probability mass function has the maximum value. There can be the following types of data distributions:

  • Unimodal, is a distribution with only one peak/mode.
  • Bimodal, a distribution with two peaks/modes.
  • Multimodal, is a distribution with two or more peaks/modes

Cause of Multimodality:

A multimodal distribution represents the presence of several patterns of extreme values in a graph. The combination of two distributions can cause multimodality in the resulting distribution.

It is important to note that the claim stating that combining two normal distributions with equal mean values will always result in a unimodal distribution is not universally true. Eisenberger derived conditions for the unimodality of the combined distribution, providing valuable insights into this phenomenon; the link is given below:

Summary statistics values do not provide any information about the multimodality of given data distribution.

Imagine your dataset has the properties shown below:

Distribution appears to be quite straightforward, but when we plot the histogram, we’ll realize that it’s a multimodal distribution.

The histogram demonstrates why we should graph our data rather than relying on just summary statistics.

--

--

Muhammad Azhar
Analytics Vidhya

BI Consultant | Tableau Ambassador. I write about data analysis, data visualization, and BI Reporting.