Numpy uncovered : Histograms using Numpy and Matplotlib

Md Khalid Siddiqui
Analytics Vidhya
Published in
4 min readSep 2, 2020

Part 2 of Guide to statistics using Numpy series

Introduction:

When we first look at a dataset, we want to be able to quickly understand certain things about it:

  • Do some values occur more often than others?
  • What is the range of the dataset (i.e., the min and the max values)?
  • Are there a lot of outliers?

We can visualize this information using a chart called a histogram.

For instance, suppose that we have the following dataset:

d = [1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 5]

A simple histogram might show us how many 1’s, 2’s, 3’s, etc. we have in this dataset.

When graphed, our histogram would look like this:

Histograms vs Barcharts

Although they appear somewhat similar, histograms and bar charts are different and serve specific purposes. A detailed discussion on difference between the two can be found here.

Histogram

Histograms are used for graphing a distribution of frequencies, for quantitative data. Visually, all the bars are usually touching, with no space between them. The bars represent the amount of values in the dataset that fall between each ‘bin’ or range of values. For example, each bin might show the number of people within certain age ranges (10–20, 20–30, …).

Bar chart

Bar charts are used to group data based on categories. Visually, these are usually spaced apart. Each bar represents how much of the data falls into a category. For example, a bar chart might show how many students there are per major in a university, where each major is a “category”.

Bins

Suppose we had a larger dataset with values ranging from 0 to 50. We might not want to know exactly how many 0’s, 1’s, 2’s, etc. we have. Instead, we might want to know how many values fall between 0 and 5, 6 and 10, 11 and 15, etc.

These groupings are called bins. All bins in a histogram are always the same size. The width of each bin is the distance between the minimum and maximum values of each bin. In our example, the width of each bin would be 5.

Histogram with values ranging from 0 to 50 and bins of size 5.

MATPLOTLIB

We can graph histograms using a Python module known as Matplotlib.

For now, familiarize yourself with the following syntax to draw a histogram:

# This imports the plotting package.  We only need to do this once.
from matplotlib import pyplot as plt
# This plots a histogram
plt.hist(data)
# This displays the histogram
plt.show()

When we enter plt.hist with no keyword arguments, matplotlib will automatically make a histogram with 10 bins of equal width that span the entire range of our data.

If you want a different number of bins, use the keyword bins. For instance, the following code would give us 5 bins, instead of 10:

plt.hist(data, bins=5)

If you want a different range, you can pass in the minimum and maximum values that you want to histogram using the keyword range. We pass in a tuple of two numbers. The first number is the minimum value that we want to plot and the second value is the number that we want to plot up to, but not including.

For instance, if our dataset contained values between 0 and 100, but we only wanted to histogram numbers between 20 and 50, we could use this command:

# We pass 51 so that our range includes 50
plt.hist(data, range=(20, 51))

Here’s a complete example:

from matplotlib import pyplot as pltd = np.array([1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 5])plt.hist(d, bins=5, range=(1, 6))plt.show()
Plot using Matplotlib pyplot with 5 bins and range 1 to 6.

A Deeper look into What the Height of a bar in the histogram represents

Each bar’s height in a histogram is the count of values that fall within each bin, or range, of values. The actual values of the data do not affect the height, as each is counted as 1 toward the total height of a bar.

Every bin of a histogram, except for the last bin, counts the values in each range in the following manner:

[start, end)

where the start value of a bin is inclusive toward the count, and the end value is exclusive. For example, with a histogram bin of values between 10 and 20, the value 10 is included, but the value 20 is not included.

The last bin of a histogram is inclusive for the end value. For example, if the last bin is between 80 and 90, then both 80 and 90 are included in the final bar.

--

--