Creating a dual axis Pareto chart in Altair

Simi Talkar
Analytics Vidhya
Published in
4 min readMar 4, 2021

What’s a Pareto chart?

As per Wikipedia, A Pareto chart is a type of chart that contains both bars and a line graph, where individual values are represented in descending by bars, and the cumulative total is represented by the line. The chart is named for the Pareto principle, which, in turn, derives its name from Vilfredo Pareto, a noted Italian economist.

The purpose of the Pareto chart is to highlight the most important among a (typically large) set of factors. In quality control, it often represents the most common sources of defects, the highest occurring type of defect, or the most frequent reasons for customer complaints, and so on.

In this article, we will analyze the percentage contribution of each negative review, patrons have expressed about a restaurant. We have a set of action items which are the “Complaint Types” and the number of patrons that have registered the complaint in the “Count” column as seen in the dataframe below.

#Capture the complaint type and count in a dictionary
data_dict = {"Complaint Type" : ['Too Noisy', 'Overpriced', 'Food is tasteless', 'Food is not fresh', 'Food is too salty', 'Not clean', 'Unfriendly staff', 'Wait time', 'No atmosphere', 'Small portions'],
"Count" : [27, 789, 65, 9, 15, 30, 12, 109, 45, 621]
}
# create a dataframe from the dictionary
df = pd.DataFrame(data_dict)
#The sort is key to calculate the cumulative sums
# And also to display the counts in descending order of importance
# of the complaints
df = df.sort_values(by=['Count'], ascending=False)# The cumulative percentage is calculated using Pandas Cumsum
# Since the dataframe is sorted by the count, the cumsum adds up the complaints from the largest to the smallest.
df["count cumsum"] = df["Count"].cumsum()# Dividing the cumulative sum for each complaint by the total gets
# us to the 100% at the end
df["cumpercentage"] = df["count cumsum"]/(df["Count"].sum())

We start by creating a list of the already sorted dataframe column of “Complaint Type”. This will satisfy the condition of displaying the complaints in descending order as required by the definition of a Pareto chart.

To visualize this data, we need two Y Axes, one for the count of complaints and one for the percentage count. For each of the Y axis encodings, we will specify this list as the sort order to align the two charts. The complaints themselves will lie along the X-axis.

sort_order = df["Complaint Type"].tolist()# The base element adds data (the dataframe) to the Chart
# The categories of complaints are positioned along the X axis
base = alt.Chart(df).encode(
x = alt.X("Complaint Type:O",sort=sort_order),
).properties (
width = 500
)
# Create the bars with length encoded along the Y axis
bars = base.mark_bar(size = 20).encode(
y = alt.Y("Count:Q"),
).properties (
width = 500
)
# Create the line chart with length encoded along the Y axis
line = base.mark_line(
strokeWidth= 1.5,
color = "#cb4154"
).encode(
y=alt.Y('cumpercentage:Q',
title='Cumulative Count',
,axis=alt.Axis(format=".0%") ),
text = alt.Text('cumpercentage:Q')
)
# Mark the percentage values on the line with Circle marks
points = base.mark_circle(
strokeWidth= 3,
color = "#cb4154"
).encode(
y=alt.Y('cumpercentage:Q', axis=None),
)
# Mark the bar marks with the value text
bar_text = bars.mark_text(
align='left',
baseline='middle',
dx=-10, #the dx and dy can be manipulated to position text
dy = -10, #relative to the bar
).encode(
y= alt.Y('Count:Q', axis=None),
# we'll use the percentage as the text
text=alt.Text('Count:Q',),
color= alt.value("#000000")
)
# Mark the Circle marks with the value text
point_text = points.mark_text(
align='left',
baseline='middle',
dx=-10,
dy = -10,
).encode(
y= alt.Y('cumpercentage:Q', axis=None),
# we'll use the percentage as the text
text=alt.Text('cumpercentage:Q', format="0.0%"),
color= alt.value("#cb4154")
)
# Layer all the elements together
(bars + bar_text + line + points + point_text).resolve_scale(
y = 'independent'
)

The key code to display the dual axis is the resolve_scale function used with “y” set as independent. Each of the axes are formatted independently as a result. Here’s another reference example for dual axis.

(bars + bar_text +  line + points + point_text).resolve_scale(
y = 'independent'
)

And Finally

The Pareto principle that applies to a wide variety of domains, states that for many outcomes, roughly 80% of consequences result from 20% of the causes (the “vital few”). It is often also referred to as the the 80/20 rule, the law of the vital few, or the principle of factor sparsity.

From the chart it is now easy to pick out the contributing factors for 80% of the discontentment from the line chart making it easy to lure the customers back again by fixing the menu pricing and portion sizes!

--

--

Simi Talkar
Analytics Vidhya

Certified DS Associate (DP-100, DA-100), pursuing Masters in University Of Michigan’s Applied Data Science Program https://www.linkedin.com/in/simi-talkar/