Stacked Bar Charts: what are they and how to make them

Stacked bar charts can be a great way to visualize data that has both a hierarchical/categorical component that evolves over time. Additionally, they can be effective at revealing trends that may be difficult to see from looking strictly at spreadsheets.

One example of this can be looking at sales of various types of products year to year where each type of product has a set count or sum of sales. With that said, stacked bar charts are most useful when there aren’t too many categories to visualize, otherwise the resulting plot will be really noisy.

Continuing from my two previous posts with pandas, I will be using the same dataset to demonstrate a simple way to make stacked bar charts. The context of this post will also be from working within a Jupyter notebook.

First off, I will import a few different python libraries for working with the data; for some advanced calculations; and for plotting the data.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

After importing these libraries, I will load in the dataset used in my previous posts and filter the columns to the ones that I want to visualize.

df = pd.read_csv('./clean_train.csv')
bar_chart_data = df[["Overall Qual","Overall Cond", "SalePrice","Year Built","Year Remod/Add", "Yr Sold"]]

A very simple way to make a stacked bar chart using only pandas is the following:

# limits the data to Overall Quality and Overall Condition
bar_chart_data = df[["Overall Qual","Overall Cond", "SalePrice","Year Built","Year Remod/Add", "Yr Sold"]]
# Groups the data
temp = bar_chart_data.groupby(['Year Built', 'Overall Qual'])[['SalePrice']].sum()
# Resets the index
temp.reset_index(inplace=True)
# Pivots the data so we can visualize the Overall Condition in a stacked manner
temp.pivot(index='Year Built', 
columns='Overall Qual',
values='SalePrice').plot(kind='bar',
figsize=(20,10),
stacked=True)

The result is the following:

While this is a very simple way to make stacked bar chart, we might want to make some changes in order to make the plot a little more clear.

In the example above, we are visualizing the “Overall Quality” metric for houses built in Aimes, Iowa between 1872 and 2010. With this dataset, I wanted to compare how the “Overall Condition” and “Overall Quality” metrics for the houses fared over time with considerations of total sales.

To do this, I will create some keys which indicate the “Overall Quality” and “Overall Condition” metrics for houses sold in Aimes, Iowa between 1872 and 2010. These will be reflected in the legend portion of the chart.

qual_keys = [1,2,3,4,5,6,7,8,9,10]
cond_keys = [1,2,3,4,5,6,7,8,9]

Next I will set some colors for the charts. These will be used to differentiate the different “Overall Quality” and “Overall Condition” values and will map to the keys set above:

# Legend Colors
colors = ['#a50026','#d73027','#f46d43','#fdae61','#fee08b','#d9ef8b','#a6d96a','#66bd63','#1a9850','#006837']

I chose these colors using colorbrewer. Because the values for “Overall Quality” and “Overall Condition” are sequential; however, for other datasets, you might want to use qualitative or diverging schemes instead.

Moving on, I am going to create a function to prepare charting data by using the following input: one of the list of keys above, whether to look at “Overall Condition” or “Overall Quality” data, and finally whether to aggregate by counts or sales prices.

The following code gets a unique list of years to set for the x-axis; creates new data frames; and finally aggregates the count of sales for houses with a given “Overall Condition” or “Overall Quality” metric either by the raw count or by the sales price of the house:

# Function for preparing the chart data
def map_data(keys, df, column_to_check, x_axis, increment_by=""):
year_range = sorted(df[x_axis].unique())
df_holder = {"data":[], "x_axis": year_range}
for key in keys:
mask = (df[column_to_check] == key)
final_data_frame = pd.DataFrame(year_range, columns=[x_axis])
if increment_by == "":
counts = df[mask].loc[:, [x_axis, column_to_check]].groupby(x_axis).count()
df_holder["data"].append(final_data_frame.join(counts, on=x_axis).fillna(0)[column_to_check].values)
else:
sums = df[mask].loc[:, [x_axis, increment_by]].groupby(x_axis).sum()
df_holder["data"].append(final_data_frame.join(sums, on=x_axis).fillna(0)[increment_by].values)
return df_holder

This is to differentiate between looking at raw sales counts versus looking at sales grouped by a house’s given “Overall Condition” or “Overall Quality” metric.

Next, I will create a function for charting the data. It will take one of the list of keys above, the dataframe of our data, the range of values for our x-axis, and how we want to aggregate our data in the visualization.

Normally we do not have to do this much work to plot a stacked bar chart; however, I am going to use the above functions in a for-loop in order to plot multiple stacked bar charts.

Here is the next function:

# Function for charting data
def chart_data_set(keys, df, column_to_check, x_axis, increment_by=""):
df = map_data(keys, df, column_to_check, x_axis, increment_by=increment_by)
# Values of each group
data_values = df["data"]
names = df["x_axis"]
plt.figure(figsize=(15,8))
for i, arr in enumerate(data_values):
if i==0:
plt.bar(names, arr, color=colors[i])
else:
plt.bar(names, arr, bottom=np.sum(data_values[0:i], axis=0), color=colors[i])
ax = plt.gca()
ax.set_xlabel(x_axis, fontsize=20)
ax.legend(keys, loc=(1.025,0.5), title=column_to_check)
if increment_by == "":
ax.set_title(f"Distribution of {column_to_check} by {x_axis}",fontsize=30, pad=30)
ax.set_ylabel(f"Count of {column_to_check}",fontsize=20)
else:
ax.set_title(f"Sum of {increment_by} by {column_to_check} and {x_axis}",fontsize=30, pad=30)
ax.set_ylabel(f"Sum of {increment_by} for {column_to_check}",fontsize=20)
plt.show()

Finally, I will create a for-loop to plot every permutation of looking at how the “Overall Quality” and “Overall Condition” interact with when a house was built, modified, and sold and aggregate the the values either by a raw count per year or looking at aggregations of sales.

Here’s the code:

for category, val in {"Overall Qual": qual_keys, "Overall Cond": cond_keys}.items():
for x_axis in ["Year Built", "Year Remod/Add", "Yr Sold"]:
chart_data_set(val, bar_chart_data,category, x_axis)
chart_data_set(val, bar_chart_data,category, x_axis, increment_by="SalePrice")

Here are the resulting visualizations:

The main takeaways from these visualizations are that

  1. Newer houses tend to have a higher “Overall Quality” metric
  2. Most houses sold in the last 20 years of data tend to have a rating of 5 for the “Overall Condition” metric.
  3. There are some statistical aberrations in the data for when looking at the data over the year the house was modified.
  4. Houses with a higher overall quality tend to sell for more.

While stacked bar charts are not always the best fit for visualization purposes, they can be powerful for visualizing categorical data over time.