The Startup
Published in

The Startup

Data Science Tutorial

Hands-on Histograms with Matplotlib

A practical guide on how to plot cool histograms.

Photo by Ostap Senyuk on Unsplash

If you’re not familiar with statistics, you can confuse bar plots with histograms. Histograms are similar to bar plots. But there are some differences. Let me explain this. A histogram displays a representation of the distribution of numerical data. So histograms are used to visualize quantitative data or numerical data, whereas bar plots are utilized for categorical data. In short, you can use the histogram to see the summary statistics. In this post, I’ll cover the following topics:

  • What is a histogram?
  • A histogram vs a bar plot
  • Plotting a histogram
  • Two-dimensional histograms
  • Application with a real dataset

Let’s dive in!

What is a histogram?

A histogram is a type of plot that visualizes data using bars. This plot is used to see the distribution of a variable. In a histogram, each bar groups numbers into ranges. Histograms are also helpful in finding outliers in the dataset.

An example histogram

A histogram vs a bar plot

Histograms are often confused with bar charts. Histograms are similar to bar plots, but a histogram groups numbers into ranges. Histograms are used to show the distribution of the variable, while bar plots are used to compare variables. To show this, let’s import Matplotlib, Numpy, and Pandas.

You can find this notebook here. Let’s use the %matplotlib inline magic command to see the plots between the lines.

Let’s set the fivethirtyeight style as the graphic style.

How to plot a histogram?

You can understand better your data with a histogram. Let’s plot a simple histogram. First, let’s create simple data from the normal distribution.

Let’s use the hist method to see a histogram of this data.

A simple histogram

Let’s give a title to this plot and name the axes.

Histograms with names of the axes

These rectangles show how many values fall into the ranges, that is, the frequencies. You can use the edgecolor parameter to see the borderlines of histograms.

Histograms where we determine the border lines

You can plot a histogram with the number of bins. Let’s plot a histogram with 10 bins.

Histograms that we divide into 10 parts

You can use a list for the boundaries of the bins. To show this, let’s create a variable named bins.

Let’s pass this variable to the hist method.

Histograms with bins

You can also add a line to the histogram showing the median value of the data. First, let’s calculate the median value of the data.

Let’s add this median value to the plot with the axvline method.

Histograms where we add lines

Comparing the histograms

You can compare the histograms of several variables. To show this, let’s generate three data from the normal distribution.

Let’s plot the histograms of these variables.

Histograms of several variables

Two-dimensional histograms

So far we have created one-dimensional histograms. You can also create two-dimensional histograms where bins are represented by dots. The hist2d method is used for this. First, let’s generate data for the x and y axes from the multivariate Gaussian distribution.

Let’s see the two-dimensional graph of this data with the hist2d method.

Two-dimensional histograms

Let’s add a color bar to the plot and also name this bar.

Two-dimensional histograms with the colorbar

Application with a real dataset

To show how to use histograms in practice, I’m going to use the imdbrating dataset. This dataset includes information about movies such as the rating, genre, duration, etc. Let’s import using the read_csv method to read the dataset.

You can download the dataset here. Let’s see the first five rows of this dataset.

Let’s plot a histogram of the movie durations.

Histograms of the duration of the movies

Let’s make the bar lines black and the histograms purple.

Purple histograms of the duration of the films

You can adjust the length of the rectangles. To show this, I’m going to create a variable named bins with the arrange method in Numpy.

Let’s pass this variable to the hist method and label the bins.

Histograms where we set bar widths

Let’s name the plot.

Histograms with axis names and titles

Conclusion

A histogram is a plot showing frequency distributions. This plot is used to see the distribution of a variable. In this post, I talked about the histograms. That’s it. I hope you enjoy it. Thank you for reading. You can find this notebook here.

Don’t forget to follow us on YouTube | GitHub | Twitter | Kaggle | LinkedIn

If this post was helpful, please click the clap 👏 button below a few times to show me your support 👇

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store