Top 25 Stata Visualizations — With Full Code

Fahad Mirza
The Stata Gallery
Published in
25 min readNov 19, 2022

Having trouble deciding what visualization to use for the type of problem at hand? This full replicable code tutorial will help you choose the right type of chart for your objectives and how to code them easily in Stata.

Effective Visualizations tend to contain the following:

  1. Conveys without distorting the real picture
  2. Simple, elegant and requires minimum effort to understand
  3. Should be aesthetically pleasing but not only this!
  4. Clutter free and focuses on the subject

There are multiple types of reasons for which you want to create visualizations. Below, you will find 4 of these reasons. Chances are that the visualization you require will fall under one (or more) of these categories.

We find many websites that talk about top graphs using R and Python but hardly any that focus on Stata. This blog is not about starting a competition but rather educating and showing people how we can really do the same visuals in Stata.

Inspiration for this post is taken from: Top 50 ggplot2 Visualizations — The Master List (With Full R Code)

This series will expand or will be developed into a sequel! So stay tuned for graphs on more categories!

0. Package Installations

1. Correlation

2. Deviation

3. Ranking

4. Distribution

More Categories to come soon! Stay Tuned!

Categories include Maps, Groups, Change, Composition, etc.

0. Package Installations

Before we start, we need to install the following packages in order to replicate visuals effectively as seen in images below for each section and category:

 * Before we start off, lets install the following packages
ssc install schemepack, replace
ssc install colrspace, replace
ssc install palettes, replace
ssc install labutil, replace

* Correlation Coef with CI (Nick Cox - corrci) (type in command window: search pr0041_4)
net describe pr0041_4, from(http://www.stata-journal.com/software/sj21-3)
net install pr0041_4

ssc install violinplot, replace
ssc install dstat, replace
ssc install moremata, replace

* Credits:
* 1. Nick Cox
* 2. Ben Jann
* 3. Asjad Naqvi

Credit for these packages go to:

  1. Nick Cox
  2. Ben Jann
  3. Asjad Naqvi

[Back to Top]

1. Correlation

The following visualizations allows us to examine the strength of correlation between 2 variables.

Scatterplot

This one is probably the most used when it comes to data analysis. This is extremely useful when we want to understand the relation and nature between two variables.

  sysuse auto, clear
twoway (scatter price mpg, mcolor(%60) mlwidth(0)) (lowess price mpg), ///
title("{bf}Scatterplot", pos(11) size(2.75)) ///
subtitle("Price Vs. MPG", pos(11) size(2.5)) ///
legend(off) ///
scheme(white_tableau)

Stata allows for drawing multiple plots as layers on top of each other (Think of this as a visualization cake). Here we have a Scatterplot that is superimposed with a Lowess plot.

[Back to Top]

Scatterplot by Group

Sometimes we require scatterplots to show us data by groups. This can be done in the same layering style that was seen above.

In the code below, you will find that a local is being generated to draft a scatterplot code for each group, followed by plotting using twoway.

  sysuse auto, clear

levelsof foreign, local(foreign)
foreach category of local foreign {
local scatter `scatter' scatter price mpg if foreign == `category', ///
mcolor(%60) mlwidth(0) ||
}

twoway `scatter' (lowess price mpg), ///
title("{bf}Scatterplot", pos(11) size(2.75)) ///
subtitle("Price Vs. MPG", pos(11) size(2.5)) ///
legend(order(1 "Domestic" 2 "Foreign") size(2)) ///
scheme(white_tableau)

The plot differentiates between Domestic and Foreign vehicles all while generating a lowess plot for the overall data. You can also modify this code to generate lowess for each vehicle category.

[Back to Top]

Jitter Plot

Using datasets present on GitHub, we will now look at an interesting case scenario which will create a scatterplot superimposed with a line of best fit (not Lowess).

  import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear

twoway (scatter hwy cty, mcolor(%60) mlwidth(0)) (lfit hwy cty), ///
title("{bf}Scatterplot with overlapping points", pos(11) size(2.75)) ///
subtitle("mpg: City vs Highway mileage", pos(11) size(2.5)) ///
legend(off) ///
scheme(white_tableau)

An interesting and neat looking plot with each marker nicely aligned in vertical lines. However, the number of points on the plot region are not equal to the number of data observations. This is because these points are overlapping and distorting the underlying data.

In order to view all the points, we need to randomly spread these points using jitter. The code below will help us visualize the entire data:

import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear  

twoway (scatter hwy cty, jitter(5) mcolor(%60) mlwidth(0)) (lfit hwy cty), ///
title("{bf}Jittered points", pos(11) size(2.75)) ///
subtitle("mpg: City vs Highway mileage", pos(11) size(2.5)) ///
legend(off) ///
scheme(white_tableau)

The higher the value for jitter, the more spread out these points will be. That said, we can now better see the total data points.

[Back to Top]

Counts Chart

Overlapped data points which we saw in the Jitter Plot visual can also be overcome using Count Charts. Think of this as weighted markers where as the number of points go up, so does the size of the marker.

  import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear

egen total = group(cty hwy)
bysort total: egen count = count(total)

twoway (scatter hwy cty [aw = count], mcolor(%60) mlwidth(0) msize(1)) (lfit hwy cty), ///
title("{bf}Counts plot", pos(11) size(2.75)) ///
subtitle("mpg: City vs Highway mileage", pos(11) size(2.5)) ///
legend(off) ///
scheme(white_tableau)

[Back to Top]

Bubble Chart

This type of plot allows you to see a number of things which includes comparison of categories against some continuous variable by changing their color and size.

Here, we look at comparison between Manufacturers, Car Displacement, and Mileage.

  import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear

keep if inlist(manufacturer, "audi", "ford", "honda", "hyundai")

recode hwy (15 16 17 18 19 = 1) (20 21 22 23 24 = 4) (25 26 27 28 29 = 8) ///
(30 31 32 33 34 = 16) (35 36 = 32), gen(weight)

levelsof manufacturer, local(options)
local wordcount : word count `options'

local i = 1
foreach option of local options {

colorpalette tableau, n(`wordcount') nograph

local scatter `scatter' scatter cty displ [fw = weight] if manufacturer == "`option'", ///
mcolor("`r(p`i')'%60") mlwidth(0) jitter(10) ||

local line `line' lfit cty displ if manufacturer == "`option'", lcolor("`r(p`i')'") ||

local ++i
}

twoway `scatter' `line', ///
title("{bf}Bubble Chart", pos(11) size(2.75)) ///
subtitle("mpg: Displacement vs City mileage", pos(11) size(2.5)) ///
ytitle("City Mileage", size(2)) ///
legend(order(3 "Honda" 4 "Hyundai" 1 "Audi" 2 "Ford" ) size(2)) ///
scheme(white_tableau)

[Back to Top]

Marginal Histogram

This plot type allows the viewing of relationship as well as distribution at the same time.

The histogram of the X and Y variables are displayed at the margins of the scatterplot.

  import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear

egen total = group(cty hwy)
bysort total: egen count = count(total)

* Using loop to write and store the plotting commands and syntax by class

twoway (scatter hwy cty [aw = count], mcolor(%60) mlwidth(0) msize(1) legend(off)) ///
(lfit hwy cty), legend(off) name(main, replace) ytitle("Highway MPG") xtitle("City MPG") ///
graphregion(margin(t=-5))
twoway (histogram cty, yscale(off) xscale(off) ylabel(, nogrid) xlabel(, nogrid) bin(30)), name(cty_hist, replace) graphregion(margin(l=16)) fysize(15)
twoway (histogram hwy, horizontal yscale(off) xscale(off) ylabel(, nogrid) xlabel(, nogrid) bin(30)), name(hwy_hist, replace) graphregion(margin(b=15 t=-5)) fxsize(20)

graph combine cty_hist main hwy_hist, hole(2) commonscheme scheme(white_tableau) ///
title("{bf}Marginal Histogram - Scatter Count plot", size(2.75) pos(11)) subtitle("mpg: Highway vs. City Mileage", size(2.5) pos(11))

[Back to Top]

Marginal Boxplot

Like the marginal histogram, here we replace histogram on the margins with a box plot.

However, this box plot you will see is visually different looking and minimalistic in comparison to its ‘Chonky’ original. Gap between whiskers is the IQR while theDot represents median.

  * Load Dataset 
import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear

egen total = group(cty hwy)
bysort total: egen count = count(total)

* Using loop to write and store the plotting commands and syntax by class

twoway (scatter hwy cty [aw = count], mcolor(%60) mlwidth(0) msize(1) legend(off)) ///
(lfit hwy cty), legend(off) name(main, replace) ytitle("Highway MPG") xtitle("City MPG") ///
graphregion(margin(t=-5))

local i = 1
local j = 10
foreach var of varlist hwy cty {
sort `var', stable

quietly summarize `var', detail

local mean_`var' = `r(mean)'
local med_p_`var' = `r(p50)'
local p75_`var' = `r(p75)'
local p25_`var' = `r(p25)'
local iqr_`var' = `p75_`var'' - `p25_`var''

generate `var'uq = `var' if `var' <= `=`p75_`var''+(1.5*`iqr_`var'')'
generate `var'lq = `var' if `var' >= `=`p25_`var''-(1.5*`iqr_`var'')'

quietly summarize `var'uq
local max_`var'uq = `r(max)'
quietly summ `var'lq
local min_`var'lq = `r(min)'

if `i' == 1 {
colorpalette tableau, nograph
local lines`i' ///
(scatteri `p75_`var'' 1 `max_`var'uq' 1, recast(line) lpattern(solid) lcolor("`r(p`j')'") lwidth(1)) || ///
(scatteri `p25_`var'' 1 `min_`var'lq' 1, recast(line) lpattern(solid) lcolor("`r(p`j')'") lwidth(1)) || ///
(scatteri `med_p_`var'' 1, ms(square) mcolor(background) msize(2)) || ///
(scatteri `med_p_`var'' 1, ms(square) mcolor("`r(p`j')'")) ||
}

else {
colorpalette tableau, nograph
local lines`i' ///
(scatteri 1 `p75_`var'' 1 `max_`var'uq', recast(line) lpattern(solid) lcolor("`r(p`j')'") lwidth(1)) || ///
(scatteri 1 `p25_`var'' 1 `min_`var'lq', recast(line) lpattern(solid) lcolor("`r(p`j')'") lwidth(1)) || ///
(scatteri 1 `med_p_`var'', ms(square) mcolor(background) msize(2)) || ///
(scatteri 1 `med_p_`var'', ms(square) mcolor("`r(p`j')'")) ||
}


drop *lq *uq

local ++i
local j = `j' + 4

}

twoway `lines1', legend(off) xlabel(, nogrid) ylabel(, nogrid) yscale(off) xscale(off) name(hwy_box, replace) graphregion(margin(b=15 t=-5)) fxsize(5)
twoway `lines2', legend(off) xlabel(, nogrid) ylabel(, nogrid) yscale(off) xscale(off) name(cty_box, replace) graphregion(margin(l=16)) fysize(5)

graph combine cty_box main hwy_box, hole(2) commonscheme ycommon xcommon scheme(white_tableau) ///
title("{bf}Marginal Box Plot - Scatter Count plot", size(2.75) pos(11)) subtitle("mpg: Highway vs. City Mileage", size(2.5) pos(11))

The code might seem a bit advanced but its actually just making use of the boxplot formula to generate them in a lean manner. This removes outliers from the boxplot and allows for maximum modification.

Code is written in a way that you just need to add variables to the Box Plot loop and it will generate accordingly with some minor tinkering.

[Back to Top]

Correlogram

As the name goes, this let’s you examine the correlation of multiple continuous variables.

This unfortunately is not packaged in Stata and so required manual coding. However, knowing how the process works allows you a much finer control in adjusting the smallest of detail according to specific needs.

Plus it is a good way to learn!

  * Load Dataset
sysuse auto, clear

* Only change names of variable in local var_corr.
* The code will hopefully do the rest of the work without any hitch
local var_corr price mpg trunk weight length turn foreign
local countn : word count `var_corr'

* Use correlation command
quietly correlate `var_corr'
matrix C = r(C)
local rnames : rownames C

* Now to generate a dataset from the Correlation Matrix
clear

* For no diagonal and total count
local tot_rows : display `countn' * `countn'
set obs `tot_rows'

generate corrname1 = ""
generate corrname2 = ""
generate y = .
generate x = .
generate corr = .
generate abs_corr = .

local row = 1
local y = 1
local rowname = 2

foreach name of local var_corr {
forvalues i = `rowname'/`countn' {
local a : word `i' of `var_corr'
replace corrname1 = "`name'" in `row'
replace corrname2 = "`a'" in `row'
replace y = `y' in `row'
replace x = `i' in `row'
replace corr = round(C[`i',`y'], .01) in `row'
replace abs_corr = abs(C[`i',`y']) in `row'

local ++row

}

local rowname = `rowname' + 1
local y = `y' + 1

}

drop if missing(corrname1)
replace abs_corr = 0.1 if abs_corr < 0.1 & abs_corr > 0.04

colorpalette HCL pinkgreen, n(10) nograph intensity(0.65)
*colorpalette CET CBD1, n(10) nograph //Color Blind Friendly option
generate colorname = ""
local col = 1
forvalues colrange = -1(0.2)0.8 {
replace colorname = "`r(p`col')'" if corr >= `colrange' & corr < `=`colrange' + 0.2'
replace colorname = "`r(p10)'" if corr == 1
local ++col
}


* Plotting
* Saving the plotting code in a local
forvalues i = 1/`=_N' {

local slist "`slist' (scatteri `=y[`i']' `=x[`i']' "`: display %3.2f corr[`i']'", mlabposition(0) msize(`=abs_corr[`i']*15') mcolor("`=colorname[`i']'"))"

}


* Gather Y axis labels
labmask y, val(corrname1)
labmask x, val(corrname2)

levelsof y, local(yl)
foreach l of local yl {
local ylab "`ylab' `l' `" "`:lab (y) `l''" "'"

}

* Gather X Axis labels
levelsof x, local(xl)
foreach l of local xl {
local xlab "`xlab' `l' `" "`:lab (x) `l''" "'"

}

* Plot all the above saved lolcas
twoway `slist', title("Correlogram of Auto Dataset Cars", size(3) pos(11)) ///
note("Dataset Used: Sysuse Auto", size(2) margin(t=5)) ///
xlabel(`xlab', labsize(2.5)) ylabel(`ylab', labsize(2.5)) ///
xscale(range(1.75 )) yscale(range(0.75 )) ///
ytitle("") xtitle("") ///
legend(off) ///
aspect(1) ///
scheme(white_tableau)

The code is designed in a way that only requires you to enter variable names in the local var_corr and the rest will be done automatically.

[Back to Top]

2. Deviation

Diverging Bars

Diverging Bars is really just a bar graph that visualizes both negative and positive values within the dataset.

The observations in the data with MPG greater than 0 are colored as green while all negative values are in red.

  sysuse auto, clear

* Standardizing variable
egen double mpg_z = std(mpg)

* Generating indicator of below and above
generate above = (mpg_z >= 0)

* Sorting the mpg_z and assigning rank
sort mpg_z, stable
generate rank_des = _n * 2

* Assigning label
labmask rank_des, value(make)

colorpalette tableau, nograph intensity(0.8)
twoway (bar mpg_z rank_des if above == 1, horizontal lwidth(0) barwidth(1.5) bcolor("`r(p3)'")) ///
(bar mpg_z rank_des if above == 0, horizontal lwidth(0) barwidth(1.5) bcolor("`r(p4)'")), ///
ytitle("") xtitle("") ///
ylabel(2(2)148, valuelabel labsize(1.25) nogrid) xlabel(-4(1)4, nogrid) ///
xscale(range(-4 4)) ///
legend(off) ///
title("{bf}Diverging Bars (Normalized MPG)", size(2.75) pos(11)) ///
scheme(white_tableau)

[Back to Top]

Diverging Lollipop Graph

Lollipop chart as you will see below is the same as the diverging bar graph drawn above except it is a more lean format and contains label values on it.

This method makes use of range spike and scatterplot to keep things editable to the maximum. The other way of creating these is to use twoway dropline.

  sysuse auto, clear
keep in 1/20

* Standardizing variable
egen double mpg_z = std(mpg)

* Sorting the mpg_z and assigning rank
sort mpg_z, stable
generate rank_des = _n

* Assigning label
labmask rank_des, value(make)

* Generate 0 point
generate zero = 0

* Labels
tostring mpg_z, gen(mpg_z_lab) force format(%3.2f)
compress

* Plot
twoway (rspike zero mpg_z rank_des, horizontal) ///
(scatter rank_des mpg_z, msize(5.3) mlabel(mpg_z_lab) mlabsize(1.5) mlabposition(0)), ///
xlabel(-2.5(1)-0.5 0 0.5(1)2.5, labsize(2)) ylabel(1(1)20, valuelabel labsize(2)) ///
legend(off) ///
ytitle("Car Name") ///
title("{bf}Diverging Lollipop Chart (Normalized MPG)", size(2.75) pos(11)) ///
scheme(white_tableau)

[Back to Top]

Diverging Dot Plot

This plot like the previous 2 examples conveys the same information but even more slimmed down in it’s approach.

There are multiple ways of doing this as well but this approach makes use of scatterplot. We can also opt for twoway dot which is a Stata built in command for doing this.

  sysuse auto, clear

* Keeping first 20 observations as example
keep in 1/20

* Standardizing variable
egen double mpg_z = std(mpg)

* Sorting the mpg_z and assigning rank
sort mpg_z, stable
generate rank_des = _n

* Assigning label onto the sorted serial number
labmask rank_des, value(make)

* Generating indicator of below and above
generate above = (mpg_z >= 0)

* Labels
tostring mpg_z, gen(mpg_z_lab) force format(%3.2f)
compress

* Plot
colorpalette tableau, nograph intensity(0.8)
twoway (scatter rank_des mpg_z if above == 0, mcolor("`r(p4)'") msize(5) mlabel(mpg_z_lab) mlabsize(1.3) mlabposition(0)) ///
(scatter rank_des mpg_z if above == 1, mcolor("`r(p3)'") msize(5) mlabel(mpg_z_lab) mlabsize(1.3) mlabposition(0)) ///
, ///
xlabel(-2.5(1)-0.5 0 0.5(1)2.5, labsize(2)) ylabel(1(1)20, valuelabel labsize(2)) ///
legend(off) ///
ytitle("Car Name") ///
title("{bf}Diverging Dot Plot (Normalized MPG)", size(2.75) pos(11)) ///
scheme(white_tableau)

[Back to Top]

Diverging Bars — Correlation Plot

This plot is similar to the Correlogram that was created above except, this version, instead of designing a correlation matrix, makes a bar plot.

The bar plot with negative values means the 2 variables are negatively correlated while positive is for positively correlated ones.

Code may seem daunting at first but it is just a simple use of local and matrix. Like the correlogram, you only need to add names of variables in the local var_corr and the code will do the rest for you.

Due to the package not being packaged, this becomes a tad bit lengthy but it is a great way of learning.

Once this code is all executed, you are left with a dataset which you can use in many other ways too, and not just bar graph.

You will also notice that colors are progressive and intensity increases with correlation (In any direction).

  sysuse auto, clear 

* Only change names of variable in local var_corr.
* The code will hopefully do the rest of the work without any hitch
local var_corr price mpg trunk weight length turn foreign
local countn : word count `var_corr'

* Use correlation command
* https://journals.sagepub.com/doi/pdf/10.1177/1536867X0800800307
* SE = (upper limit – lower limit) / 3.92

quietly corrci `var_corr'
matrix C = r(corr)
local rnames : rownames C

* Now to generate a dataset from the Correlation Matrix
clear

* This will not have the diagonal of matrix (correlation of 1)
local tot_rows : display `countn' * `countn'
set obs `tot_rows'

generate corrname1 = ""
generate corrname2 = ""
generate byte y = .
generate byte x = .
generate double corr = .
generate double abs_corr = .

local row = 1
local y = 1
local rowname = 2

foreach name of local var_corr {
forvalues i = `rowname'/`countn' {
local a : word `i' of `var_corr'
replace corrname1 = "`name'" in `row'
replace corrname2 = "`a'" in `row'
replace y = `y' in `row'
replace x = `i' in `row'
replace corr = C[`i',`y'] in `row'
replace abs_corr = abs(C[`i',`y']) in `row'

local ++row

}

local rowname = `rowname' + 1
local y = `y' + 1

}

drop if missing(corrname1)

* Generating a variable that will contain color codes
* colorpalette HCL pinkgreen, n(20) nograph intensity(0.75) //Not Color Blind Friendly
colorpalette CET CBD1, n(20) nograph //Color Blind Friendly option
generate colorname = ""
local col = 1
forvalues colrange = -1(0.1)0.9 {
replace colorname = "`r(p`col')'" if corr >= `colrange' & corr < `=`colrange' + 0.1'
replace colorname = "`r(p20)'" if corr == 1
local ++col
}

* Grouped correlation of variables
generate group_corr = corrname1 + " - " + corrname2
compress


* Sort the plot
sort corr, stable
generate rank_corr = _n
labmask rank_corr, values(group_corr)


* Plotting
* Run the commands ahead in one go if you have reached this point in breaks
* Saving the plotting code in a local
forvalues i = 1/`=_N' {

local barlist "`barlist' (scatteri `=rank_corr[`i']' 0 `=rank_corr[`i']' `=corr[`i']' , recast(line) lcolor("`=colorname[`i']'") lwidth(*6))"

}

* Saving labels for Y-Axis in a local
levelsof rank_corr, local(yl)
foreach l of local yl {

local ylab "`ylab' `l' `" "`:lab (rank_corr) `l''" "'"

}

twoway `barlist', ///
legend(off) scheme(white_tableau) ylabel(`ylab', labsize(2.5)) ///
xlab(, labsize(2.5)) ///
ytitle("Pairs") xtitle("Correlation Coeff.") ///
title("{bf}Correlation Coefficient (Diverging Bar Plot)", size(2.75) pos(11))

[Back to Top]

Diverging Bars — Correlation Plot with Confidence Intervals

This was a request on Twitter by a user and makes use of the same code we have for Diverging Bars — Correlation Plot except for minor tweaks.

This code makes use of Stata user written command developed by Nick Cox by the name of corrci that creates a matrix containing confidence interval upper and lower bounds.

The bars, on the tips will have an overlaid Confidence Interval. Similarly, we only need to add variables to the local var_corr and the code does the rest.

  sysuse auto, clear 

* Only change names of variable in local var_corr.
* The code will hopefully do the rest of the work without any hitch
local var_corr price mpg trunk weight length turn foreign
local countn : word count `var_corr'

* Use correlation command
* https://journals.sagepub.com/doi/pdf/10.1177/1536867X0800800307
* SE = (upper limit – lower limit) / 3.92

quietly corrci `var_corr'
matrix C = r(corr)
local rnames : rownames C
matrix LB = r(lb)
matrix UB = r(ub)
matrix Z = r(z) //matrix of z = atanh r

egen miss = rowmiss(`var_corr')
count if miss == 0
local N = r(N)
* Now to generate a dataset from the Correlation Matrix
clear

* This will not have the diagonal of matrix (correlation of 1)
local tot_rows : display `countn' * `countn'
set obs `tot_rows'

generate corrname1 = ""
generate corrname2 = ""
generate byte y = .
generate byte x = .
generate double corr = .
generate double lb = .
generate double ub = .
generate double z = .
generate double abs_corr = .

local row = 1
local y = 1
local rowname = 2

foreach name of local var_corr {
forvalues i = `rowname'/`countn' {
local a : word `i' of `var_corr'
replace corrname1 = "`name'" in `row'
replace corrname2 = "`a'" in `row'
replace y = `y' in `row'
replace x = `i' in `row'
replace corr = C[`i',`y'] in `row'
replace lb = LB[`i',`y'] in `row'
replace ub = UB[`i',`y'] in `row'
replace z = Z[`i',`y'] in `row'
replace abs_corr = abs(C[`i',`y']) in `row'

local ++row

}

local rowname = `rowname' + 1
local y = `y' + 1

}

drop if missing(corrname1)

* Generating total non missing count and P-Values
generate N = `N'
generate double p = min(2 * ttail(N - 2, abs_corr * sqrt(N - 2) / sqrt(1 - abs_corr^2)), 1)

* Generate stars
generate stars = "*" if p <= 0.1 & p > 0.05
replace stars = "**" if p <= 0.05 & p > 0.01
replace stars = "***" if p <= 0.01

* Generating a variable that will contain color codes
* colorpalette HCL pinkgreen, n(20) nograph intensity(0.75) //Not Color Blind Friendly
colorpalette CET CBD1, n(20) nograph //Color Blind Friendly option
generate colorname = ""
local col = 1
forvalues colrange = -1(0.1)0.9 {
replace colorname = "`r(p`col')'" if corr >= `colrange' & corr < `=`colrange' + 0.1'
replace colorname = "`r(p20)'" if corr == 1
local ++col
}

* Grouped correlation of variables
generate group_corr = corrname1 + " - " + corrname2
compress


* Sort the plot
sort corr, stable
generate rank_corr = _n
labmask rank_corr, values(group_corr)


* Plotting
* Run the commands ahead in one go if you have reached this point in breaks
* Saving the plotting code in a local
forvalues i = 1/`=_N' {

local barlist "`barlist' (scatteri `=rank_corr[`i']' 0 `=rank_corr[`i']' `=corr[`i']' , recast(line) lcolor("`=colorname[`i']'") lwidth(*6))"

}

* Saving labels for Y-Axis in a local
levelsof rank_corr, local(yl)
foreach l of local yl {

local ylab "`ylab' `l' `" "`:lab (rank_corr) `l''" "'"

}

twoway `barlist' ///
(rspike lb ub rank_corr, horizontal lcolor(white) lwidth(*2)) ///
(rspike lb ub rank_corr, horizontal lcolor(black*.5)), ///
legend(off) scheme(white_tableau) ylabel(`ylab', labsize(2.5)) ///
xlab(, labsize(2.5)) ///
ytitle("Pairs") xtitle("Correlation Coeff.") ///
title("{bf}Correlation Coefficient with Confidence Interval (Diverging Bar Plot)", size(2.75) pos(11))

Note: The code also generates P-Values and Stars which you can also add if required using the scatterplot command.

[Back to Top]

Area Chart

These are most commonly used to show how metrics such as % change or % returns compared to any specified baseline.

  import delimited "https://github.com/tidyverse/ggplot2/raw/main/data-raw/economics.csv", clear

* YOY Change
generate yoy = (psavert[_n] - psavert[_n-1]) / psavert[_n-1]

generate monthyear = ym(year(date(date, "YMD")), month(date(date, "YMD")))
format monthyear %tm

twoway (area yoy monthyear if monthyear <= tm(1975m12), lwidth(0)), ///
xla(84(12)185, format(%tmCY)) ///
plotregion(lstyle(solid) lwidth(.1)) ///
xtitle("") ///
ytitle("% Returns for Personal savings", size(2.75)) ///
xscale(noline) yscale(noline) ///
title("{bf}Area Chart", pos(11) size(3)) ///
subtitle("% Returns for Personal Savings", pos(11) size(2.5)) ///
scheme(white_tableau)

[Back to Top]

3. Ranking

Ordered Bar Charts

Ordered Bar Chart is the same as any regular Bar graph except the bars are now ordered by the Y axis variable.

This plots the average city mileage for each vehicle brand.

  import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear

* Acquiring average mileage (city) by manufacturer
collapse (mean) cty, by(manufacturer)

graph bar (asis) cty, over(manufacturer, sort(1) label(labsize(1.75))) scheme(white_w3d) ///
title("{bf}Ordered Bar Chart", pos(11) size(2.75)) ///
ytitle("City" "Mileage", orient(horizontal) size(2)) ///
ylabel(, labsize(2)) ///
subtitle("Make Vs. Avg. Mileage", pos(11) size(2.5))

[Back to Top]

Lollipop Charts (Vertical)

This visual is the same ordered bar but in Lollipop form.

Makes use of the built in Stata command twoway dropline that does the job really well here.

 import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear

* Acquiring average mileage (city) by manufacturer
collapse (mean) cty, by(manufacturer)
sort cty, stable

generate order = _n

labmask order, values(manufacturer)

* Plotting
quietly summarize order
twoway dropline cty order, ///
msize(2) ///
yscale(range(0 25)) ///
ylabel(0(5)25) ///
ytitle("City" "Mileage", orient(horizontal)) ///
xscale(range(0.25)) ///
xlabel(`r(min)'(1)`r(max)', valuelabel labsize(1.75)) ///
xtitle("") ///
title("{bf}Lollipop Chart", pos(11) size(2.75)) ///
subtitle("Make Vs. Avg. Mileage", pos(11) size(2.5)) ///
scheme(white_w3d)

We can also make use of scatterplot and range spike to create a more flexible version as seen previously in Diverging Lollipop Graph above.

[Back to Top]

Dot Plot (Horizontal)

Dot plots are the same as Lollipop graphs, minus the line and is transposed to horizontally. It visualized the rank ordering of items in relation to calculated values and how far each value is from each other.

 import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear

* Acquiring average mileage (city) by manufacturer
collapse (mean) cty, by(manufacturer)
sort cty, stable

generate order = _n

labmask order, values(manufacturer)

* Plotting
quietly summarize cty
local xmin = `r(min)'

quietly summarize order
twoway dot cty order, horizontal ///
msize(2) ///
yscale(range(`r(min)' `r(max)')) ///
ylabel(`r(min)'(1)`r(max)', valuelabel labsize(1.75)) ///
ytitle("Make", orient(horizontal) size(2)) ///
xscale(range(`xmin')) ///
xlabel(10(5)25, nogrid) ///
xtitle("Mileage", size(2)) ///
title("{bf}Dot Plot", pos(11) size(2.75)) ///
subtitle("Make Vs. Avg. Mileage", pos(11) size(2.5)) ///
scheme(white_w3d)

[Back to Top]

Slope Chart

Great way of comparing the pre-post or positional elements between 2 points on time. The code approach is also done in a manner to allow for flexibility

 import delimited "https://raw.githubusercontent.com/selva86/datasets/master/gdppercap.csv", varnames(1) clear

* Adding variable names to imported data
rename (v2 v3) (y1952 y1957)

* Checking which value is lower than previous data value
generate negative = (y1957 < y1952)

generate lab1952 = continent + ", " + string(round(y1952))
generate lab1957 = continent + ", " + string(round(y1957))

generate continent1 = 1
generate continent2 = 2

colorpalette w3, nograph
twoway (pcspike y1952 continent1 y1957 continent2 if negative == 0, legend(off) lcolor("`r(p11)'")) ///
(pcspike y1952 continent1 y1957 continent2 if negative == 1, legend(off) lcolor("`r(p1)'")) ///
(scatter y1952 continent1, ms(i) mlabposition(9) mlabel(lab1952)) ///
(scatter y1957 continent2, ms(i) mlabposition(3) mlabel(lab1957)) ///
(scatteri 12700 1 "{bf}Year 1952", ms(i) mlabpos(9)) ///
(scatteri 12700 2 "{bf}Year 1957", ms(i) mlabpos(3)) ///
, ///
ylabel(0(4000)12000, labsize(2) nogrid) ///
ytitle("Avg." "GDP/Capita", size(2) orient(horizontal)) ///
yscale(range(0 13000)) ///
xlabel(1(1)2) ///
xscale(off) ///
xtitle("") ///
xscale(range(0.2 2.8)) ///
aspect(1.3) ///
title("{bf}Slope Chart", pos(11) size(2.75)) ///
subtitle("Mean GDP per capita: 1952 Vs. 1957" " ", pos(11) size(2)) ///
graphregion(margin(r=25)) ///
scheme(white_w3d)

[Back to Top]

Dumbbell Plot

A convenient and great method if you wish to:

  1. Visualize positions between different time periods in a single row.
  2. Visualize distance between the 2 points
 import delimited "https://raw.githubusercontent.com/selva86/datasets/master/health.csv", varnames(1) clear 

* Preparing Y-axis
generate srno = _n * 3
labmask srno, values(area)

foreach var of varlist pct* {

replace `var' = `var' * 100

}

colorpalette w3, nograph
twoway (rspike pct_2013 pct_2014 srno, horizontal lcolor("`r(p6)'*0.4")) ///
(scatter srno pct_2013, mcolor("`r(p6)'*0.4")) ///
(scatter srno pct_2014, mcolor("`r(p6)'")) ///
, ///
ylabel(3(3)78, valuelabel angle(horizontal) labsize(2)) ///
legend(order(3 "2014" 2 "2013") pos(11) row(1) size(2)) ///
ytitle("") ///
title("{bf}Dumbbell Plot", pos(11) size(2.75)) ///
subtitle("% Change in Health Indicators by Area: 2014 vs. 2013", pos(11) size(2)) ///
scheme(white_tableau)

[Back to Top]

4. Distribution

Histogram on Continuous Variable (Over Category)

The original histogram command in Stata generates plot for only one variable supplied.

To generate histogram on a continuous variable, I used Stata undocumented help undocumented command to generate histogram variables within the dataset before plotting.

This was a bit tricky, and there is a chance that it might not work for all use cases, but feel free to shoot me a message if there is a problem with the command and that it requires updating.

The plot generates a histogram using twoway__histogram_gen on class of vehicle.

 clear frames

import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear
replace class = subinstr(class, "2", "two", .)

frame copy default original, replace

levelsof class, local(cls)
foreach l of local cls {

frame put displ class, into(`l')
frame change `l'

twoway__histogram_gen displ if class == "`l'", start(1) width(0.1) frequency generate(h x, replace)
rename (h) (h_`l')

keep x h_`l'
drop if missing(x)

save `l', replace

frame change original

}

frame change original
twoway__histogram_gen displ, start(1) width(0.1) frequency generate(h x, replace)
drop h
generate tag = 1 if missing(x)
replace x = _n if missing(x)

foreach l of local cls {

merge 1:1 x using `l', nogen
* erase `l'.dta

}

replace x = . if tag == 1
drop tag


keep x h_*
drop if missing(x)
reshape long h_, i(x) j(type) string
bysort x (type) : gen cumul_sum_ = sum(h_) if !missing(h_)
drop h_*
reshape wide cumul_sum_, i(x) j(type) string

* Plotting
ds cumul*
local wcount: word count `r(varlist)'

forvalues i = `wcount'(-1)1 {

ds cumul*
local a : word `i' of `r(varlist)'
display "`a'"
colorpalette tableau, nograph n(`i')
local bar "`bar' (bar `a' x, fcolor("`r(p`i')'") barwidth(0.1) lwidth(0.1) lcolor(gs4))"

}

twoway `bar', xlabel(1(1)7) scheme(white_tableau) ///
legend(order(1 "2 Seater" 2 "SUV" 3 "Subcompact" 4 "Pickup" 5 "Minivan" 6 "Midsize" 7 "Compact") rowgap(0) size(2)) ///
xlabel(, labsize(2)) ylabel(, labsize(2)) ///
ytitle("Count", size(2)) xtitle("Displacement", size(2)) ///
title("{bf}Histogram with Auto Binning", pos(11) size(2.75)) ///
subtitle("Engine Displacement across Vehicle Classes", pos(11) size(2))

[Back to Top]

Histogram on Categorical Variable

This results in a direct bar chart (frequency) for each category and stacked.

 import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear


* Plotting
forvalues i = 1/7 {

local barlwidth "`barlwidth' bar(`i', lwidth(0)) "

}

graph bar (count), over(class) over(manufacturer, label(alternate labsize(2))) asyvars stack ///
scheme(white_w3d) ///
ylabel(, nogrid) ///
legend(order(7 "SUV" 6 "Subcompact" 5 "Pickup" 4 "Minivan" 3 "Midsize" 2 "Compact" 1 "2 Seater") rowgap(0.25) size(2)) ///
lintensity(*0) ///
`barlwidth' ///
title("{bf}Histogram on Categorical Variable", pos(11) size(2.75)) ///
subtitle("Manufacturer across Vehicle Classes", pos(11) size(2))

The legend mimics a scenario if all categories were present and in order.

[Back to Top]

Density Plot (By Category)

As the name goes, this generates a Kdensity plot for each cylinder category for the vehicles.

Each cylinder category has its own histogram here.

 import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear

levelsof cyl, local(cylinders)
foreach cylinder of local cylinders {

quietly summarize cty
local kden "`kden' (kdensity cty if cyl == `cylinder', range(`r(min)' `r(max)') recast(area) fcolor(%70) lwidth(*0.25))"

}

twoway `kden', scheme(white_w3d) ///
legend(subtitle("Cylinders", size(2)) label(1 "4") label(2 "5") label(3 "6") label(4 "8") rowgap(0.25) size(2)) ///
title("{bf}Density Plot", pos(11) size(2.75)) ///
ytitle("Density", size(2) orient(horizontal)) ///
ylabel(, nogrid labsize(2)) ///
xtitle("City Mileage", size(2)) ///
xlabel(, nogrid labsize(2)) ///
subtitle("City Mileage over number of cylinders", pos(11) size(2))

[Back to Top]

The Box Plot

This is the regular box plot which we are all well aware of and is commonly and easily generated using the built in Stata command.

A great tool to study distribution. It allows for viewing distributions, the median, range and outliers.

Line within the red box is the Median. Top of the red box 75th percentile and bottom is the 25th percentile. The lines above and below the box are the whiskers and denotes a distance of 1.5*IQR (Inter Quartile Range). All points above and below the ends of whiskers are tagged as extremes (outliers).

 import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear

graph box cty, over(class) ///
ytitle("City Mileage", size(2.25)) ///
ylabel(, nogrid) ///
title("{bf}Box Plot", pos(11) size(2.75)) ///
b1title(" " "Class of vehicle", size(2.5)) ///
subtitle("City Mileage grouped by class of vehicle", pos(11) size(2)) ///
scheme(white_w3d)

[Back to Top]

Tufte Style Box Plot (Over Category)

This is your regular box plot but on a diet.

This is a minimalistic version of the plot but essentially shows the same information. Gap between the whiskers is the IQR.

The box plot is generated over vehicle class

 import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear

* Tufte styled box plot
graph box cty, over(class) ///
box(1, color(white%0)) ///
medtype(marker) ///
medmarker(mcolor(black) mlwidth(0)) ///
cwhiskers ///
alsize(0) ///
intensity(0) ///
lintensity(1) ///
lines(lpattern(solid) lwidth(medium)) ///
ylabel(, nogrid) ///
yscale(noline) ///
title("{bf}Box Plot", pos(11) size(2.75)) ///
subtitle("City Mileage over number of cylinders", pos(11) size(2)) ///
scheme(white_w3d)

[Back to Top]

Minimalistic Box Plot (Over Category & By Type)

This version is also a slimmed down box plot but allows you to bin each type for better viewing

 import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear


levelsof cyl, local(cylinders)
local catcount: word count `cylinders'
forvalues i = 1/`catcount' {

colorpalette tableau, nograph n(`catcount')
local boxopt "`boxopt' box(`i', color("`r(p`i')'")) "

}

display `"`boxopt'"'

graph box cty, over(cyl) ///
by(class, ///
row(1) legend(pos(3)) imargin(l=1.5 r=1.5) style(compact) ///
title("{bf}Box Plot", pos(11) size(2.75)) ///
subtitle("City Mileage over number of cylinders" " ", pos(11) size(2)) ///
note(, size(2)) ///
) ///
asyvars ///
`boxopt' ///
boxgap(50) ///
medtype(marker) ///
medmarker(mcolor() mlwidth(0) msize(1)) ///
cwhiskers ///
alsize(0) ///
intensity(0) ///
lintensity(1) ///
lines(lpattern(solid) lwidth(medium)) ///
ylabel(, nogrid) ///
yscale(noline) ///
ytitle("City Mileage", size(2.25)) ///
subtitle(, size(2.5)) /// //size of group headers
legend(size(2.25) rowgap(0.25) subtitle("Cylinders", size(2.25))) ///
scheme(white_tableau)

[Back to Top]

Violin Plot

This is similar to the box plot but allows us to see the density instead.

The first visual below contains box and distribution while the second one only visualizes density.

 import delimited "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/mpg.csv", clear

* Original Violin Plot
* This plot contains box and distribution
violinplot cty, over(class) vertical scheme(white_w3d) ///
ytitle("City Mileage", size(2.25)) ///
ylabel(, nogrid) ///
title("{bf}Box Plot", pos(11) size(2.75)) ///
b1title(" " "Class of vehicle", size(2.5)) ///
subtitle("City Mileage grouped by class of vehicle", pos(11) size(2))


* To make a version without box we can use:
violinplot cty, over(class) vertical scheme(white_w3d) nobox nomedian noline nowhiskers ///
ytitle("City Mileage", size(2.25)) ///
ylabel(, nogrid) ///
title("{bf}Box Plot (Density Only)", pos(11) size(2.75)) ///
b1title(" " "Class of vehicle", size(2.5)) ///
subtitle("City Mileage grouped by class of vehicle", pos(11) size(2))

[Back to Top]

Population Pyramid Plot

Allows us to visualize category by group. Usually such plots are seen in surveys where they visualize gender over age groups.

The plot is also an interesting way of showing how things are funneled over time or stages.

Example below:

 import delimited "https://raw.githubusercontent.com/selva86/datasets/master/email_campaign_funnel.csv", clear 

format users %20.0g
replace users = round(users)
replace users = -(users) if users < 0 & gender == "Female"
replace users = -(users) if users > 0 & gender == "Male"

encode stage, gen(stage_n)

forvalues i = -15000000(5000000)15000000 {

if `i' != 0 {
local xlab "`xlab' `i' `"`=abs(`i')/1000000'm"'" //Use of compound quotes to work with labels with absolute (abs) values
}

else {
local xlab "`xlab' 0 `"0"'"
}
}

* display `"`xlab'"'

twoway (bar users stage_n if gender == "Female", horizontal lwidth(0) barwidth(0.8)) ///
(bar users stage_n if gender == "Male", horizontal lwidth(0) barwidth(0.8)) ///
, ///
yscale(noline) ///
xlabel(`xlab', nogrid) ///
ylabel(1(1)18, nogrid noticks valuelabel labsize(2)) ///
ytitle("Stage") ///
legend(order(1 "Female" 2 "Male") size(2)) ///
title("{bf}Email Campaign Funnel", size(2.75)) ///
scheme(white_tableau)

[Back to Top]

Conclusion

This concludes our guide on creating some of the most commonly used visualizations using Stata with relative ease! I hope you enjoyed this guide. Please do share your versions in any other color schemes or data you have used.

Also, please do point out if there are any bugs with the code.

Happy Visualizing!

About the author

I am an Economist/Data Analyst by profession and a Stata addict quite frankly. Currently based in Islamabad, Pakistan and working as a Consultant at The World Bank & Centre for Economic Research in Pakistan (CERP).

Graduate from Lahore University of Management Sciences (LUMS) & National University of Computer & Emerging Sciences.

You can connect with me via GitHub, Medium, Twitter, LinkedIn or simply via email: 17160013@lums.edu.pk

--

--

The Stata Gallery
The Stata Gallery

Published in The Stata Gallery

This community-driven blog aims to provides readers the option to contribute and learn from each other. Feel free to submit your own Stata guides!

Fahad Mirza
Fahad Mirza

Written by Fahad Mirza

Economist, Data Analyst, Data Visualizer & Stata Enthusiast. F1 Fanatic! My Twitter: https://twitter.com/theFstat

Responses (5)