Create a grouped bar chart with Matplotlib and pandas

José Fernando Costa
Analytics Vidhya
Published in
5 min readOct 22, 2020

As I was working on freeCodeCamp’s Data Analysis with Python certification, I came across a tricky Matplotlib visualization: a grouped bar chart. I’ve been making my way through the projects, but the guidance is minimal. This is good because it makes you put in the work to arrive at the desired solution, but it is awful if you don’t have much experience with Matplotlib, pandas and Numpy, or even if you’re just having difficulties with the current exercise.

So, I’m writing this article to share my solution on how to create the grouped bar chart from the “Page View Time Series Visualizer” project. I had a hard time understanding how to create this visualization in Matplotlib so I hope this article is enlightening for your data analysis projects.

The data

Since I’m sharing the solution for the certification’s exercise, the demo in this article will use the same data. The data is available in the sample repl.it environment set up by freeCodeCamp for the project.

This page views dataset contains only two columns: one with the date of recording, and another for the page views in that day.

Data preview after loading with pandas
Data preview after loading with pandas

Now that you know what data we’re working with, let’s move on to the data loading and pre-processing code.

Data loading and pre-processing

I will first show you all the code for loading and pre-processing the data, and then explain each step. You can find that code in the code gist below.

Data pre-processing code

The first few code lines are fairly straightforward pandas code: load a CSV file using the read_csv function, then change the data type of a column. In this case, we want the “date” data to be treated as datetime data. Afterwards, we sort the data by the date of page views recording and set that column as the DataFrame’s index. This will help with the transformation’s ahead.

On line 10, we filter the DataFrame to exclude rows in the top and bottom 2.5 percentiles of page views, to remove possible outliers (this is actually a step in the certification’s exercise).

In the last block of code, we finish processing the data by creating a column for the year and month of the recordings. Because we changed the dates to the datetime type, we can extract their year and month by accessing the DataFrame’s index, and then the respective attributes: df.index.year and df.index.month.

Since the months come as integers (1 to 12), we also apply a transformation of mapping those integers to the correct month name, stored in the months list. We can use the months’ integer representation to retrieve the names from the list via index, adjusting for the 0-based indices of Python lists.

On the last line of this first code gist, we change the data type of the “month” column to be Categorical, using the months list’s elements as the categories. This is useful because now “month” stores categories and they keep the order of the months in the months list. In other words, we can properly sort the months from January to December in the DataFrame. However, we won’t need to use another sorting function: Matplotlib will do this on its own when creating the bar chart later.

Data visualization

Now for the data visualization part: shaping the DataFrame into a useful format and plotting the chart.

Data visualization code

(please note this second gist is still part of the previous script, I just split it in two for the explanations)

The first thing we do is to transform the DataFrame into a pivot table DataFrame. In practice, the DataFrame changes from this

DataFrame before the pivot transformation
DataFrame before the pivot transformation

Where we have the “date” as the index, and columns for the page views, year and month of the recording, into this pivot table:

Resulting pivot table dataframe
Resulting pivot table dataframe
df_pivot = pd.pivot_table(
df,
values="page_views",
index="year",
columns="month",
aggfunc=np.mean
)

Recalling the function that creates the pivot table, we have to specify:

  • The source DataFrame
  • The column whose values will be put in the cells
  • The column whose values will be used as the new index
  • The column whose values will be used as the new columns
  • The aggregation function to apply to the values in the data cells

In the end, as you can see in the screenshot above, we now have the years as the indices, a column for each month, and the average/mean page views per month and year in each cell. Please note that using an average aggregation function was another specification of the certification exercise. Any aggregation function could have been used.

The DataFrame is now ready for plotting.

On line 17 of the code gist we plot a bar chart for the DataFrame, which returns a Matplotlib Axes object. We use this object to obtain a Matplotlib Figure object that allows us to change the plot’s dimensions. We also change the axes labels afterwards.

At the end of the code gist, we export the plot as a PNG file, using the Figure object.

Resulting grouped bar plot
Resulting grouped bar plot

Conclusion

In summary, we created a bar chart of the average page views per year. But, since this is a grouped bar chart, each year is drilled down into its month-wise values.

It is true this solution is kind of magic, since we simply had to call the plot(kind="bar") method on the DataFrame. However, the trick was to pivot the DataFrame to have the X-axis data in the index and the grouping categories in the column headings. The Y-axis values are the values from the DataFrame’s cells. pandas and Matplotlib are smart enough to understand this, provided the data is in the required shape.

All in all, creating a grouped bar chart with Matplotlib is not easy. The code itself is tricky to get around, as you need to get the DataFrame into a specific shape, something that is not simple if you’re not used to manipulating data. Furthermore, there weren’t that many resources or examples for this, and the solution I found was through this StackOverflow reply.

For comparison and curiosity, take a look into how to create a similar grouped bar chart in Plotly. The plotting function only requires two extra parameters to achieve this visualization and doesn’t require the extra pivotting step.

At any rate, I hope this solution is relevant for you and helps in future Matplolib and pandas work!

Lastly, you can find all the code and resources on my GitHub repository. If you don’t want to visit GitHub, you can find below the complete script.

Complete script

--

--