Histograms in Data Science

Fernando Aguilar
3 min readMar 15, 2019

--

Histograms resemble vertical bar charts. However, histograms depict the underlying frequency of a set of discrete or continuous data that are measured on an interval scale. This depiction makes it easy to visualize the underlying distribution of the dataset, and inspect for other properties such as skewness and kurtosis.

As an example, let’s take the ages of the players in the FIFA 19 video game. To be able to manipulate and visualize the histograms I am going to make use of the pandas and matplotlib.pyplot libraries. You can get a copy of this dataset in Kaggle through: https://www.kaggle.com/karangadiya/fifa19.

import matplotlib.pyplot as plt
import pandas as pd
plt.style.use('ggplot')
df = pd.read_csv('data.csv')
df.Age.describe()
---------------
count 18207.000000
mean 25.122206
std 4.669943
min 16.000000
25% 21.000000
50% 25.000000
75% 28.000000
max 45.000000
Name: Age, dtype: float64

From looking at the descriptive statics above, I expect that our histogram plot for the Age variable, will provide a robust visualization that approximates to a normal distribution given that the mean and the median are fairly similar. Since the mean appears to be bit larger that the median I’m expecting the distribution to be positively skewed, allowing for some outliers on the right side of our plot. This could be explained by older football players that decide to postpone their retirement beyond the retirement age of most football players. Let’s start by plotting our histogram using matplotlib.pyplot.hist() function:

plt.figure(figsize=(12,8))
plt.hist(df['Age'], color='green',bins=10)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title("FIFA 2019 Football Players' Age Histogram")
plt.show()

From the histogram, we can confirm that it resembles a positively skewed normal distribution. In this scenario, the bins, or data intervals are equal, hence the height of the bin reflects the frequency. This is not always the case, it is possible to have different bins sizes (i.e. bins for ages 16–18, 18–22, 22–25, etc.). In this scenario, the frequency of occurrences for each bin is obtained from the area of the bar. In other words, it is the product of height multiplied by the width of the bin that indicates the frequency of occurrences within that bin. Let’s look at the same histogram with different bin sizes:

plt.figure(figsize=(12,8))
plt.hist(df['Age'], color='blue',bins=[16,18,22,25,30,34,39,41,45])
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title("FIFA 2019 Football Players' Age Histogram - Different Bin Sizes")
plt.show()

As you can see from having different sizes of bins, we can also distort the apparent distribution of the data since high frequency ages can be grouped together or they can be grouped with less frequent ages producing a bias in our plot. For this reason it is advised to have bins of equal width so that they can be visually comparable to each other.

Another consideration to have is to make sure that the bins are not too small or too large. If bins are too small, too much individual data is displayed and makes it more difficult to visualize the underlying distribution of the data overall. On the other hand, if bins are too large, the trends become so smoothed out that finding a distribution might be really hard to do. Let’s compare them side by side:

new_fig = plt.figure(figsize=(18,6))
ax1 = plt.subplot(131)
ax2 = plt.subplot(132)
ax3 = plt.subplot(133)
ax1.hist(df['Age'], color='red',bins=4)
ax2.hist(df['Age'], color='green',bins=10)
ax3.hist(df['Age'], color='blue',bins=40);
ax1.set_title('Histogram with 4 bins')
ax2.set_title('Histogram with 10 bins')
ax3.set_title('Histogram with 40 bins')
ax2.set_xlabel('Age')
ax1.set_ylabel('Frequency')
plt.show()

Conclusion

Histograms are a powerful visualization tool to easily inspect for the underlying distribution of our data set. However, it is important to have in mind the number of bins, and try to have same-width bins as well for ease of interpretation.

--

--

Fernando Aguilar

Data Analyst at Enterprise Knowledge, currently pursuing an MS in Applied Statistics at PennState, and Flatiron Data Science Bootcamp Graduate.