Stacked Bar Chart: Data Preparation and Visualization
A stacked bar chart requires your data to be in a specific format. Use R to prepare and visualize your data.
TL;DR
- Find the code here.
In this article, we are going to create a stacked bar chart. One challenge in doing so is to have our data formatted properly. But before transforming the data, we need to understand the structure of a stacked bar chart.
Anatomy of the stacked bar chart
3 elements are important in a stacked bar chart:
- Category: the categories are displayed on the x-axis. For example, the main category can be the most popular social media.
- Height: the height of the bar represents the values displayed on the y-axis. Likewise, this could be the number of users for each social media.
Now this gives you a bar chart. Just add one element:
- Subcategory: Each category contains subcategories stacked on top of each other. In the case of our example, gender could be a subcategory, separating the number of each social media user by gender.
Now let’s create a stacked bar chart.
Load the data
Save the following data as “politics_approval_rates.csv”.
"Issue","Approve","Disapprove","No.Opinion"
"Race relations",52,38,10
"Education",49,40,11
"Terrorism",48,45,7
"Energy policy",47,42,11
"Foreign affairs",44,48,8
"Environment",43,51,6
"Situation in Iraq",41,53,6
"Taxes",41,54,5
"Healthcare policy",40,57,3
"Economy",38,59,3
"Situation in Afghanistan",36,57,7
"Federal budget deficit",31,64,5
"Immigration",29,62,9
Source: FlowingData | Nathan Yau
This data is about participants who were asked whether they approved or disapproved of how a president dealt with 13 issues.
Read the CSV file with R
data <- read.csv('politics_approval_rates.csv')
df <- data.frame(data)
df
Now, in order to make a stacked bar chart, we need to identify the 3 elements: the category, subcategory and height.
- Category: the issue (Race relations, Education, etc.) will be our main category.
- Height: the approval rate for each opinion.
- Subcategory: the opinion will be our subcategory. It can be either “approve”, “disapprove” or “no opinion”.
Now, look at the table. Notice how the opinions and approval rates are structured? We don’t want that for a stacked bar chart. We need to rotate these to get something like this:
Each issue has 3 opinions with their respective values.
Now let’s code.
Transform the data
The category — issues
First, we replicate each issue 3 times because we have 3 subcategories (see figure above).
issues <- c()
# loop through each of the 13 issues
for (issue in df$Issue)
# replicate the issue 3 times
issues <- c(issues, rep(issue, 3))
# 13 * 3 = 39 issues
issues
[1] "Race relations" "Race relations" "Race relations"
[4] "Education" "Education" "Education"
[7] "Terrorism" "Terrorism" "Terrorism"
[10] "Energy policy" "Energy policy" "Energy policy"
[13] "Foreign affairs" "Foreign affairs" "Foreign affairs"
[16] "Environment" "Environment" "Environment"
[19] "Situation in Iraq" "Situation in Iraq" "Situation in Iraq"
[22] "Taxes" "Taxes" "Taxes"
[25] "Healthcare policy" "Healthcare policy" "Healthcare policy"
[28] "Economy" "Economy" "Economy"
[31] "Situation in Afghanistan" "Situation in Afghanistan" "Situation in Afghanistan"
[34] "Federal budget deficit" "Federal budget deficit" "Federal budget deficit"
[37] "Immigration" "Immigration" "Immigration"
Let’s understand the code above:
- the rep(vector, n) function replicates a vector n times.
- vector <- c(vector, value) means append “value” to the end of the vector.
So, the following line simply means: replicate “issue” 3 times, then append it to “issues”.
issues <- c(issues, rep(issue, 3))
The subcategory — opinions
Next, we create the subcategories. We have 3 opinions: approve, disapprove and no opinion.
opinions <- colnames(df[, 2:4])
opinions
[1] "Approve" "Disapprove" "No.Opinion"
We want to replicate the subcategories 13 times because there are 13 categories (issues).
Notice that the opinions are replicated in their entirety, whereas the issues were replicated individually.
opinions <- rep(opinions, 13)
opinions
[1] "Approve" "Disapprove" "No.Opinion" "Approve" "Disapprove" "No.Opinion" "Approve"
[8] "Disapprove" "No.Opinion" "Approve" "Disapprove" "No.Opinion" "Approve" "Disapprove"
[15] "No.Opinion" "Approve" "Disapprove" "No.Opinion" "Approve" "Disapprove" "No.Opinion"
[22] "Approve" "Disapprove" "No.Opinion" "Approve" "Disapprove" "No.Opinion" "Approve"
[29] "Disapprove" "No.Opinion" "Approve" "Disapprove" "No.Opinion" "Approve" "Disapprove"
[36] "No.Opinion" "Approve" "Disapprove" "No.Opinion"
The last thing to do is to extract the value for each opinion.
The height — approval rates
We proceed by getting the approval rates, transposing them, then flattening them to a 1D vector.
Get the approval rates
values <- df[2:4]
values
Approve Disapprove No.Opinion
[1,] 52 38 10
[2,] 49 40 11
[3,] 48 45 7
[4,] 47 42 11
[5,] 44 48 8
[6,] 43 51 6
[7,] 41 53 6
[8,] 41 54 5
[9,] 40 57 3
[10,] 38 59 3
[11,] 36 57 7
[12,] 31 64 5
[13,] 29 62 9
Transpose the values
values <- t(values)
values
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
Approve 52 49 48 47 44 43 41 41 40 38 36 31 29
Disapprove 38 40 45 42 48 51 53 54 57 59 57 64 62
No.Opinion 10 11 7 11 8 6 6 5 3 3 7 5 9
Convert them to a 1D vector
values <- as.vector(values)
values
[1] 52 38 10 49 40 11 48 45 7 47 42 11 44 48 8 43 51 6 41 53 6 41 54 5 40 57 3 38 59 3 36 57 7 31 64 5 29 62 9
We are basically taking the columns one by one and stacking them on top of each other.
Load the data into a data frame
# create the dataframe
final_df = data.frame(issues, opinions, values_vector)
# rename the third column
colnames(final_df)[3] <- "Approval Rates"
final_df
issues opinions Approval Rates
1 Race relations Approve 52
2 Race relations Disapprove 38
3 Race relations No.Opinion 10
4 Education Approve 49
5 Education Disapprove 40
6 Education No.Opinion 11
7 Terrorism Approve 48
8 Terrorism Disapprove 45
9 Terrorism No.Opinion 7
10 Energy policy Approve 47
11 Energy policy Disapprove 42
12 Energy policy No.Opinion 11
13 Foreign affairs Approve 44
14 Foreign affairs Disapprove 48
15 Foreign affairs No.Opinion 8
16 Environment Approve 43
17 Environment Disapprove 51
18 Environment No.Opinion 6
19 Situation in Iraq Approve 41
20 Situation in Iraq Disapprove 53
21 Situation in Iraq No.Opinion 6
22 Taxes Approve 41
23 Taxes Disapprove 54
24 Taxes No.Opinion 5
25 Healthcare policy Approve 40
26 Healthcare policy Disapprove 57
27 Healthcare policy No.Opinion 3
28 Economy Approve 38
29 Economy Disapprove 59
30 Economy No.Opinion 3
31 Situation in Afghanistan Approve 36
32 Situation in Afghanistan Disapprove 57
33 Situation in Afghanistan No.Opinion 7
34 Federal budget deficit Approve 31
35 Federal budget deficit Disapprove 64
36 Federal budget deficit No.Opinion 5
37 Immigration Approve 29
38 Immigration Disapprove 62
39 Immigration No.Opinion 9
>
Beautiful! Now here comes the fun part, visualization! 🎉
Visualizing the data —The Stacked bar chart
Explanation will follow.
# load the ggplot2 library
library(ggplot2)
ggplot(final_df, aes(fill=opinions, y=`Approval Rates`, x=issues)) +
geom_col(position="stack")
theme(axis.text.x = element_text(angle = 45, margin = margin(t=30, "pt")))
ggplot()
First, we call the ggplot function and pass it 2 arguments:
- The final data frame
- aes(): a function to map the columns' names to the arguments.
aes()
We pass 3 arguments to the aes() function:
- x: the labels on the x-axis.
- y: the height of the bars
- fill: the color of the bars
geom_col()
- position: “stack” to stack the bars. Use “dodge” to group the bars (See figure below).
theme()
We use it to rotate the x labels to make the graph more readable:
- axis.text.x: rotate the x labels by 45 degrees and a top margin of 30 pt.
Tip: In RStudio, put “?” before a function and execute to display help. Ex: ?aes.
You can also place the cursor on a function and press F1.
Grouped bar chart
ggplot(final_df, aes(x=issues, y=values, fill=opinions)) +
geom_col(position="dodge") +
theme(axis.text.x = element_text(angle = 45, margin = margin(t=30, "pt")))
In this article, we created a stacked bar chart to understand people’s opinions on a president’s policies. The funny thing is that it’s always the same story: We want to visualize our data instantly but we can’t. Instead, we spend 90% of the time understanding and cleaning the data, but only 10% visualizing it. But hey, in the end, it’s worth it.