Data Science Tutorial
Hands-on Histograms with Matplotlib
A practical guide on how to plot cool histograms.
If you’re not familiar with statistics, you can confuse bar plots with histograms. Histograms are similar to bar plots. But there are some differences. Let me explain this. A histogram displays a representation of the distribution of numerical data. So histograms are used to visualize quantitative data or numerical data, whereas bar plots are utilized for categorical data. In short, you can use the histogram to see the summary statistics. In this post, I’ll cover the following topics:
- What is a histogram?
- A histogram vs a bar plot
- Plotting a histogram
- Two-dimensional histograms
- Application with a real dataset
Let’s dive in!
What is a histogram?
A histogram is a type of plot that visualizes data using bars. This plot is used to see the distribution of a variable. In a histogram, each bar groups numbers into ranges. Histograms are also helpful in finding outliers in the dataset.
A histogram vs a bar plot
Histograms are often confused with bar charts. Histograms are similar to bar plots, but a histogram groups numbers into ranges. Histograms are used to show the distribution of the variable, while bar plots are used to compare variables. To show this, let’s import Matplotlib, Numpy, and Pandas.
You can find this notebook here. Let’s use the %matplotlib inline magic command to see the plots between the lines.
Let’s set the fivethirtyeight style as the graphic style.
How to plot a histogram?
You can understand better your data with a histogram. Let’s plot a simple histogram. First, let’s create simple data from the normal distribution.
Let’s use the hist method to see a histogram of this data.
Let’s give a title to this plot and name the axes.
These rectangles show how many values fall into the ranges, that is, the frequencies. You can use the edgecolor parameter to see the borderlines of histograms.
You can plot a histogram with the number of bins. Let’s plot a histogram with 10 bins.
You can use a list for the boundaries of the bins. To show this, let’s create a variable named bins.
Let’s pass this variable to the hist method.
You can also add a line to the histogram showing the median value of the data. First, let’s calculate the median value of the data.
Let’s add this median value to the plot with the axvline method.
Comparing the histograms
You can compare the histograms of several variables. To show this, let’s generate three data from the normal distribution.
Let’s plot the histograms of these variables.
Two-dimensional histograms
So far we have created one-dimensional histograms. You can also create two-dimensional histograms where bins are represented by dots. The hist2d method is used for this. First, let’s generate data for the x and y axes from the multivariate Gaussian distribution.
Let’s see the two-dimensional graph of this data with the hist2d method.
Let’s add a color bar to the plot and also name this bar.
Application with a real dataset
To show how to use histograms in practice, I’m going to use the imdbrating dataset. This dataset includes information about movies such as the rating, genre, duration, etc. Let’s import using the read_csv method to read the dataset.
You can download the dataset here. Let’s see the first five rows of this dataset.
Let’s plot a histogram of the movie durations.
Let’s make the bar lines black and the histograms purple.
You can adjust the length of the rectangles. To show this, I’m going to create a variable named bins with the arrange method in Numpy.
Let’s pass this variable to the hist method and label the bins.
Let’s name the plot.
Conclusion
A histogram is a plot showing frequency distributions. This plot is used to see the distribution of a variable. In this post, I talked about the histograms. That’s it. I hope you enjoy it. Thank you for reading. You can find this notebook here.
Don’t forget to follow us on YouTube | GitHub | Twitter | Kaggle | LinkedIn
If this post was helpful, please click the clap 👏 button below a few times to show me your support 👇