Advanced Bar Graphs in Stata (Part 3): Stacked Bar Graphs

Stacked bar graphs are a powerful way of visualizing discrete variables and relationships between them. This guide covers how to make them using Stata software. I provide readers with worked examples and detailed code for making a variety of stacked bar graphs plus numerous aesthetic improvements.

John V. Kane
The Stata Gallery

--

This third installment of guides on bar graphs in Stata covers how to make so-called “stacked” bar graphs. (Parts 1 and 2 covered how to make bar graphs of means with confidence intervals and bar graphs for discrete variables, respectively.)

Similar to the graphs discussed in Part 2, stacked bar graphs aim to convey variation in discrete (i.e., non-continuous) variables as well as relationships between discrete variables. The key difference is that, rather than having a separate bar for each category of a variable, categories are instead “stacked” on top of one another within the same bar.

How to choose between regular or stacked bar graphs?

Using a stacked bar graph (versus a traditional bar graph):

Upsides:

  1. More efficient use of plot space (one bar features multiple categories instead of just one)
  2. A bit fancier and more modern-looking
  3. A very visually intuitive way of seeing a relationship between discrete variables

Downsides:

  1. For a general audience, stacked bar graphs are perhaps a bit more challenging / unintuitive to interpret
  2. Trying to compare (what is essentially) bar heights becomes too cumbersome. As such, there is a heavier reliance on labels

That said, stacked and traditional bar graphs convey very similar information and can, in most instances, be used interchangeably. Thus, the decision to use one over the other may be mostly a matter of taste.

Preliminaries

A couple quick things before we get started:

  1. Graphs will either use the “stcolor” scheme (new to Stata v.18) or schemes from “schemepack” by Asjad Naqvi. To install, simply execute:
ssc install schemepack, replace

You can then explore all your awesome new schemes by executing the following:

graph query, schemes

2. Note: All graphs use “AbelPro-Regular” font, which can be downloaded here. For details on installing/using fonts that are not native to Stata, see here.

A Simple Stacked Bar Graph

As we did in the Part 2 guide, let’s begin by importing the lbw.dta data set (which features data on mothers and the birthweights of their newborn children):

#Import lbw
webuse lbw.dta, replace

Let’s again create a four-category variable to capture which age category each new mother is in. (Note: I cover data-cleaning commands in Stata, as well as the very useful -fre-package, here.)

#Create a four-category age variable
gen agecats=1 if inrange(age, 14, 19)
replace agecats=2 if inrange(age, 20, 23)
replace agecats=3 if inrange(age, 24, 26)
replace agecats=4 if inrange(age, 27, 45)

fre agecats

label define agecats 1 "14-19" 2 "20-23" 3 "24-26" 4 "27-45", replace
label values agecats agecats

tab agecats

The syntax for stacked bar graphs is very similar to the graphs discussed in Part 2, yet two crucial options must be specified together:

  1. asyvars : this option creates separate colors for each category of the first “over” variable
  2. stack : this option is what stacks categories on top of one another within the same bar

Knowing this, here is code to create a simple stacked bar graph:

# stacked bar graph command
graph bar, /// basic bar graph command (will display vertically by default
over(agecats) /// the groups *within* each bar (values seen in the legend)
stack asyvars // both options must be specified for a stacked bar graph

Note: if you want frequencies rather than percentages, simply add the (count) command right after bar (and before the comma) in the code above. (So, the first line of code would be: graph bar (count), … )

With Stata 18’s “stcolor” scheme as the default, here is the bare-bones graph we get:

bare-bones stacked bar graph

A few things to note. First, by default, the graph will be displayed vertically.

Second, we see where the “stacked” bar graph gets its name: the four age categories are stacked on top of one another, rather than being displayed as four separate bars. Indeed, if we simply remove the “stack” option from the code above, we get the following graph:

Same code but removing the “stack” option: it’s a regular bar graph.

This traditional bar graph is identical to the stacked bar graph in terms of the information it contains. However, while the second graph makes it easy to quickly see the percent of the sample falling into each age category (simply look at the bar height relative to the y-axis), we’d have to do a bit of mental math to determine this in the stacked graph.

For example, in the stacked graph above, the bottom line of the “20–23” age category starts somewhere slightly below 30 on the y-axis and the top line of this category is slightly below 60, meaning that the percent of the sample in the “20–23” category is about 60 minus 30 = 30%.

As is probably becoming clear, the stacked bar graph does not easily allow one to determine bar heights (i.e., percentages within each category). For this reason, adding labels for the percentages is a must.

Some Key Options to Include

Here we’ll add labels for percentages, make the graph horizontal (simply specify graph hbar instead of graph bar), and use a different scheme:

graph hbar, ///
over(agecats) /// the groups *within* each bar
stack asyvars /// commands needed for stacking bars
blabel(bar, pos(center) format(%3.0f) size(medium) color(black)) /// add percentage labels
scheme(swift_red)

As we’ll use percentage labels throughout, here is a brief explanation about the “blabel(bar, )” options specified above:

  1. pos(center) : places percentage label in the center of each bar
  2. format(%3.0f) : indicates that percentage labels should not exceed more than 3 numbers to the left of the decimal and have 0 numbers to the right of the decimal. In effect, this rounds each percentage to a whole number. Specifying “3.1f” would round to one decimal place and “3.2f” to two decimal places
  3. size(medium) : specifies the size of the percentage label
  4. color(black) : specifies that the labels be in black text

We get the following graph:

Our stacked bar graph is now horizontal and we can also read what percent of the sample falls into each age category.

But there remain many additional improvements that can be made.

Next, we’ll make the legend a bit nicer, make the percentage labels white (often easier to read than black), title the axis, and adjust the graph dimensions:

# stacked bar graph with more options
graph hbar, /// basic horizontal bar graph command
over(agecats) /// the groups *within* each bar (values seen within the legend)
stack asyvars /// commands needed for stacking bars
blabel(bar, pos(center) format(%3.0f) size(medium) color(white)) /// add percentages
scheme(swift_red) /// specify scheme
legend(pos(3) col(1) title("Age Group of Mother", size(medsmall) box fcolor(lavender) color(white) lcolor(gs10)) size(medium)) ///
ytitle(Percent of Sample) /// title y-axis
graphregion(margin(small)) /// make region between plot and outer edge of graph size small
xsize(6.5) ysize(4.5) // make graph 6.5 inches wide and 4.5 inches tall

Since the legend is crucial in stacked bar graphs, here is a brief explanation of the options featured in the “legend( )” line of code above:

  1. pos(3) : place the legend in the 3 o’clock position (outside the plot, by default)
  2. col(1) : render the legend as one column
  3. title(“…”, size(medsmall) box fcolor(lavender) color(white) lcolor(gs10)) size(medium)) : This gives the legend a title. The text of the title is medium-small and placed within a box that is filled with lavender color. The text of the title is white. The outline of the box is gs10 (which is a grayscale value falling between 1 and 15, closer to white). The size of the legend itself is medium.

We get the following graph:

Stacked Bar Graphs For Relationships Between Two Variables

Let’s imagine we wanted to examine the relationship between mothers’ racial identification and their age when the child was born. The “race” variable only has three categories (Black, White, and Other), but let’s look at the relationship as a cross-tab:

A cross-tab showing the relationship between mothers’ race and age category in the lbw.dta data set

While there isn’t a statistically significant relationship here, there is still some noteworthy variation that we might want to convey graphically (e.g., nearly a third of White mothers in this sample fall in the highest age category while less than 12% of Black mothers fall into this category).

How would we do it? Here are the key options to be aware of:

  1. Adding a second over( ) variable : this will determine the values that are arrayed along the axis rather than in the legend
  2. Specifying the percentage option: this ensures that each stacked bar adds up to 100%

Knowing this, here is some example code:

#Stacked bar graph for a relationship between two variables

graph hbar, /// basic horizontal bar graph command (can make vertical by removing "h")
over(agecats) /// the groups *within* each bar (values seen within bars)
over(race) /// the groups that appear on the axis, identifying each stacked bar
stack asyvars /// commands needed for stacking bars
percentage /// necessary for communicating percentages within each category of second over() variable
ylab(, glpattern(solid) glcolor(gs15)) /// add vertical solid light gray lines
blabel(bar, pos(center) format(%3.0f) size(medium) color(white)) /// add percentages
legend(pos(6) row(1) title("Age Group of Mother", size(medsmall) margin(vsmall) box fcolor(black) color(white) lcolor(gs10)) size(medium)) ///
ytitle(Percent of Sample) /// title y-axis
graphregion(margin(small)) /// make region between plot and outer edge of graph size small
xsize(6.5) ysize(4.5) // make graph 6.5 inches wide and 4.5 inches tall

We get the following graph:

This helps reveal one of the main strengths of a stacked graph: it is a very intuitive way of visualizing a cross-tab. We easily see, per the example above, that the percent of White mothers falling in the highest age group (32) is much larger than the percent of Black mothers (12).

Essentially, the greater the extent to which the colors do not align vertically (that is, from stacked bar to stacked bar), the more that a relationship exists between the two variables. For example, we see above that the blue category is substantially wider for Black mothers than for mothers in the “Other” racial group. Thus, the blue categories do not align vertically — they are different widths.

Conversely, if each racial group had an identical-looking distribution of percentages (for example, if the blue category is the same width in all three racial groups, the red category is the same width in all three racial groups, etc.), it would indicate very little relationship between mothers’ racial identification and their age when they gave birth.

Here’s another example of the same relationship but with some alternative options — e.g., a graph title, a different scheme, different color text for the percentages, and a different “fill color” for the plot.

In addition, the code below specifies the gap( ) option within the second over( ) option. The larger the number in this gap( ) option, the more space between the bars and the thinner the bars will be. This can be helpful for improving aesthetics in some cases.

graph hbar, /// basic horizontal bar graph command (can make vertical by removing "h")
over(agecats) /// the groups *within* each bar (values seen within bars)
over(race, gap(100)) /// note use of "gap()" to adjust distances / widths of stacked bars
stack asyvars ///
percentage ///
ytitle(Percent) ///
scheme(white_viridis) /// set scheme
plotregion(fcolor(gs15)) /// fill plot region with gs15 color
blabel(bar, pos(center) format(%3.0f) size(medium) color(gs4)) /// add percentages
ylab(, glpattern(solid) glcolor(gs12) glwidth(vthin)) ///
legend(pos(6) row(1) title("Age Group of Mother", size(medsmall) box fcolor(black) color(white) lcolor(gs10)) size(medium)) ///
title("Relationship Between Mothers' Race & Age Group", box fcolor(white) color(black) span) ///
graphregion(margin(vsmall)) ///
xsize(6.5) ysize(4.5) // graph dimensions

We get the following:

Here’s another example for users who may want/need the graph to appear in monochrome (that is, no color). The code below uses the “s2mono” scheme to accomplish this, but some other good options are “s1mono” and (in Stata 18) “stmono2":

#Stacked bar graph in monochrome colors

graph hbar, /// basic horizontal bar graph command (can make vertical by removing "h")
over(agecats) /// the groups *within* each bar (values seen within bars)
over(race, gap(100)) /// note use of "gap()" to adjust distances / widths of stacked bars
stack asyvars /// commands needed for stacking bars
percentage ///
ytitle(Percent) ///
scheme(s2mono) ///
blabel(bar, pos(center) format(%3.0f) size(medium) color(white)) /// add percentages
ylab(, glpattern(solid) glcolor(gs12) glwidth(vthin)) ///
legend(pos(6) row(1) title("Age Group of Mother", size(medsmall) margin(vsmall) box fcolor(gs4) color(white) lcolor(gs10)) size(medium)) ///
title("Relationship Between Mothers' Race & Age Group", box fcolor(black) color(white) span) ///
graphregion(margin(vsmall)) ///
xsize(6.5) ysize(4.5) // graph dimensions

We get the following graph:

Stacked bar graph in monochrome

Customizing Bar Colors

So far we’ve been allowing the scheme to determine the color of each bar (that is, the colors of the categories appearing within the legend).

But, like with regular bar graphs, Stata allows us to have full control over each bar’s coloring and outline. As an illustration, see the bottom four lines of the following code:

# Stacked bar graphs with customized colors
graph hbar, /// basic horizontal bar graph command (can make vertical by removing "h")
over(agecats) /// the groups *within* each bar (values seen within bars)
over(race) ///
stack asyvars /// commands needed for stacking bars
percentage ///
ytitle(Percent) ///
scheme(white_jet) ///
blabel(bar, pos(center) format(%3.0f) size(medium) color(white)) /// add percentages
ylab(, glpattern(solid) glcolor(gs15) glwidth(vthin)) ///
legend(pos(6) row(1) title("Age Group of Mother", size(medsmall) box fcolor(gs6) color(white) lcolor(gs10)) size(medium)) ///
title("Relationship Between Mothers' Race & Age Group", box fcolor(black) color(white) span bexpand) ///
graphregion(margin(vsmall)) ///
xsize(6.5) ysize(4.5) /// graph dimensions
bar(1, fintensity(100) fcolor(red%90) lcolor(black) lwidth(vthin)) ///
bar(2, fintensity(100) fcolor(stc4%90) lcolor(black) lwidth(vthin)) ///
bar(3, fintensity(100) fcolor(ebblue%90) lcolor(black) lwidth(vthin)) ///
bar(4, fintensity(100) fcolor(stc15%90) lcolor(black) lwidth(vthin)) //

Here’s a brief explanation of the bar(#) option as well as the sub-options within it:

  1. bar(#) : Here you enter the number of the category you want to customize. For example, entering “1” will customize the first category of your main over( ) variable; entering “3” will customize the third category of your main over( ) variable; etc. Note: because “agecats” has four categories, we are customizing four bars (if your variables has fewer/more categories, you’d cutomize fewer/more bars)
  2. fintensity( ) : this adjusts the fill intensity. The default is 80, but I find that increasing it to 100 makes the colors a bit more vibrant. Beyond 100 begins to make the color substantially darker
  3. fcolor(%#) : this determines the fill color. Adding “%#” adjusts the fill opacity (the default is 100, which means the bar has 0 transparency). For example, specifying fcolor(red%90) will make the bar red at 90% opacity
  4. lcolor( ) : this determines the color of the line that outlines the bar
  5. lwidth( ) : this determines the width of the line that outcomes the bar

Having customized each of the four categories of “agecats”, we get the following graph:

stacked bar graph with customized colors for the age group variable

Stacked Bar Graphs With Three Variables

Let’s say wanted make separate cross-tabs (like the one above) across each value of some third variable. For example, let’s look at the relationship between “race” and “agecats” for each value of “smoke” (the “smoke” variable =0 for mothers who did not smoke during their pregnancy and =1 for mothers who did smoke during their pregnancy).

We can obtain cross-tabs quickly with the following code:

# Obtain cross-tabs for agecats and race for each value of smoke
bysort smoke: tab agecats race, column nofreq chi2

Here is the output:

Notably, there is a significant relationship (p=.002) between mothers’ race and age category among non-smokers (see top cross-tab).

How might we graph these two cross-tabs as stacked bar graphs?

We could of course just make two separate stacked graphs (one “if smoke==0” and one “if smoke==1”), but there is an easier (and fancier!) way: we can specify a third over( ) variable (“smoke”).

Here is some example code. Notice in the code that some options have been added to adjust the color and angle of the labels of the third over( ) variable (“smoke”). Also, the bar colors of the first over( ) variable have again been customized:

# Stacked bar graph with three variables

graph hbar, /// basic horizontal bar graph command
over(agecats) /// the groups *within* each bar (values seen within bars)
over(race) /// the second over() variable: the values of each stacked bar
over(smoke, label(labcolor(white) angle(vertical))) /// third over() variable
stack asyvars /// commands needed for stacking bars
percentage /// necessary for communicating percentages within each category
ytitle(Percent) ///
scheme(black_jet) ///
blabel(bar, pos(center) format(%3.0f) size(medium) color(white)) /// add percentages
ylab(, glpattern(solid) glcolor(gs3) glwidth(vthin)) ///
legend(pos(6) row(1) title("Age Group of Mother", size(medsmall) box fcolor(gs3) color(white) lcolor(gs10)) size(medium)) ///
title("Relationship Between Mothers' Race & Age Group", box fcolor(black) color(white) span bexpand) ///
subtitle("Disaggregated by Smoking Status", box fcolor(black) color(white) span bexpand) ///
graphregion(margin(vsmall)) ///
xsize(6.5) ysize(4.5) /// graph dimensions
bar(1, fintensity(100) fcolor(blue%95) lcolor(white) lwidth(thin)) ///
bar(2, fintensity(100) fcolor(midblue%95) lcolor(white) lwidth(thin)) ///
bar(3, fintensity(110) fcolor(cyan%95) lcolor(white) lwidth(thin)) /// make intensity a little darker
bar(4, fintensity(100) fcolor(stgreen%95) lcolor(white) lwidth(thin)) //

We get the following graph 😁:

Stacked bar graph with three variables

Again, notice how nicely this visualizes what we saw in the two cross-tabs above: the top three stacked bars are the first cross-tab (non-smoker group), and the bottom three stacked bars are the second cross-tab (smoker group). The widths of each color clearly depend a great deal on which racial group we are looking at — this is an indication that the two variables are substantially related.

Combining Separate Stacked Bar Graphs

In the example above, the stacked bar graphs were separated according to each value of “smoke”.

However, what if we wanted to combine different stacked bar graphs into one single graph?

This can be a helpful thing to do so long as both graphs use the same (first) over( ) variable — that is, so long as both graphs will use the same legend. (If they would need different legends, then it is best to just have two separate graphs.)

To accomplish this final task, we’ll again use the terrific user-written package -grc1leg-, which was introduced in Part 1 and can be installed with the following code:

# Install -grc1leg- 
net install grc1leg, from("http://www.stata.com/users/vwiggins") replace

Our first graph will look at the relationship between our age group variable and having a low-birthweight baby (the “low” variable, coded 0=no and 1=yes), while the second graph will look at the relationship between smoking status and having a low-birthweight baby.

The two graphs thus both use the “low” variable as the first over( ) variable.

Following the same steps covered in Part 1, we’re first going to include in our code the saving( ) option to save each graph to our working directory. (The working directory is where Stata saves files to and pull files from. To set your working directory, go to File > Change working directory… > choose location.)

So that the graphs have useful labels , we’ll also define and apply labels for “low”.

Here is the example code:

#Define and apply labels to "low" variable
label define low 0 "Normal Weight" 1 "Low Weight"
label values low low

#First graph
graph hbar, /// basic horizontal bar graph command (can make vertical by removing "h")
over(low) /// the groups *within* each bar (values seen within bars)
over(agecats, label(labcolor(white) angle(vertical))) ///
stack asyvars /// commands needed for stacking bars
percentage ///
ytitle(" ") ///
scheme(black_jet) ///
blabel(bar, pos(center) format(%3.0f) size(medium) color(white)) /// add percentages
ylab(, nolabel notick glpattern(solid) glcolor(gs3) glwidth(vthin)) ///
legend(pos(6) row(1) title("Weight of Newborn Child", size(medsmall) box fcolor(gs3) color(white) lcolor(gs10)) size(medium)) ///
graphregion(margin(vsmall)) ///
l1title(Age Group of Mother, color(gs2) orientation(vertical) box fcolor(gs12) lcolor(black) bexpand) ///
xsize(6.5) ysize(4.5) /// graph dimensions
bar(1, fintensity(100) fcolor(midblue%90) lcolor(cyan) lwidth(thin)) ///
bar(2, fintensity(100) fcolor(lime%90) lcolor(cyan) lwidth(thin)) ///
saving(g1, replace) // this saves the graph to our working directly as "g1"

#Second graph
#note: use gap suboption to make some bars thinner
graph hbar, /// basic horizontal bar graph command
over(low) /// the groups *within* each bar (values seen within bars)
over(smoke, gap(150) label(labcolor(white) angle(vertical))) ///
stack asyvars /// commands needed for stacking bars
percentage ///
ytitle(Percent) ///
scheme(black_jet) ///
blabel(bar, pos(center) format(%3.0f) size(medium) color(white)) /// add percentages
ylab(, glpattern(solid) glcolor(gs3) glwidth(vthin)) ///
legend(pos(6) row(1) title("Weight of Newborn Child", size(medsmall) box fcolor(gs3) color(white) lcolor(gs10)) size(medium)) ///
graphregion(margin(vsmall)) ///
l1title(Smoking Status of Mother, color(gs2) orientation(vertical) box fcolor(gs12) lcolor(black)) ///
xsize(6.5) ysize(4.5) /// graph dimensions
bar(1, fintensity(100) fcolor(midblue%90) lcolor(cyan) lwidth(thin)) ///
bar(2, fintensity(100) fcolor(lime%90) lcolor(cyan) lwidth(thin)) ///
saving(g2, replace) // this saves the graph to our working directly as "g2"

Now that we’ve created the two graphs, we can combine them using the grc1leg command. We’re also specifying the following key options:

  1. legendfrom( ) : specifies to use the legend from “g2” (in this case it doesn’t matter whether we choose g1 or g2)
  2. col( ) : specifies how many columns we want the graphs to be arrayed in (here we are specifying that we want the two graphs arrayed in one column)
  3. iscale(#) : values below 1 shrink the text so as to prevent crowding/overlapping

Here is the code:

# Combining g1 and g2 with -grc1leg-

grc1leg "g1.gph" "g2.gph", ///
legendfrom("g2.gph") ///
col(1) ///
iscale(.7) ///
scheme(black_jet) ///
graphregion(margin(vsmall)) ///
title("% Low Birthweight by Mothers' Age Group and Smoking Status", box fcolor(black) color(white) span) ///
subtitle("Example Using -grc1leg- Package", box fcolor(black) color(white) span bexpand) ///
xsize(6.5) ysize(5.5)

Last graph! 🙌

Two separate stacked bar graphs, combined via -grc1leg-

Notice that the top graph shows the relationship between “agecats” and “low”, while the bottom graph shows the relationship between “smoke” and “low”.

In other words, the graph shows two distinct cross-tabs, each of which features one variable that is common to both graphs (“low”) but another variable that is unique to each graph. And, through the magic of -grc1leg-, we have just one legend!

I sincerely hope this series on bar graphs in Stata has been helpful for you! Good luck and enjoy!

About the Author

John V. Kane is Clinical Associate Professor at the Center for Global Affairs and an Affiliated Faculty member of NYU’s Department of Politics. He received his Ph.D. in political science and his primary research interests include public opinion, political psychology, and experimental research methodology. His research has been published in a variety of top-ranking peer-reviewed journals, including the American Political Science Review, American Journal of Political Science, the Journal of Politics, and the Journal of Experimental Political Science. His research has been featured in numerous media outlets, including The New York Times, The Washington Post, and National Public Radio. He has taught graduate courses on political psychology, research methods, statistics and data analysis, and has also received teaching excellence awards from both New York University and Stony Brook University. His website is www.johnvkane.com. Follow him on X/Twitter, ResearchGate, LinkedIn, and/or BlueSky Social.

--

--

John V. Kane
The Stata Gallery

John V. Kane is an Associate Professor at NYU's Center for Global Affairs. He researches political attitudes & experimental methods. Twitter: @UptonOrwell