Stata Graph Replication: Infant Mortality Rate visualization

Published in

The Stata Gallery

7 min readJun 20, 2022

This article aims to present the Stata code for replicating the Infant Mortality Rate (IMR) visualization made by The Times of India. The original visualization is not very complex and presents a comparative picture of IMR for 3 time points (2010, 2015 and 2020) and the state-wise improvement in the IMR over the years. The visualization has been replicated using Stata.

The guide assumes that the readers have basic knowledge of Stata like importing dataset into Stata, writing and running a dofile and using the user interface of Stata.

The program replicates the visualization to create the following:

Step 1: Setting up the data

The dataset for the visualization is available on my GitHub page in Excel format. The dataset has been created by taking IMR values for states from the published Sample Registration System (SRS) bulletins. The dataset is not automated at present since it involves extracting data values from pdf documents available on the internet. However, I am working on automating the entire process of by parsing the pdf documents for data values and storing them in excel/csv formats. Currently the dataset contains 4 columns (viz. State, IMR 2020, IMR 2015 & IMR 2010).

In the first step we import the Excel file into Stata and after all data processing we save the file as .dta file for further analysis. All files are places in same folder and the folder structure as described in The Stata Guide by Asjad Naqvi has been used to organize the files and folders.

clear all*** specifying the working directory** change this directory to your directory namecap cd "/Users/ritulkamal/OneDrive/Stata/master"*** Installing user written packages for customizing the graphs and fontsssc install schemepack, replace
set scheme white_tableaussc install palettes, replace
ssc install colrspace, replacegraph set window fontface "Arial"
*** importing the raw data from excel to Stataimport excel "C:\Users\hp\OneDrive\Stata\raw\SRS_bulletin.xlsx", sheet("Sheet") firstrow   //update the path here

Step 2: Data cleaning, generating variables and installing User Written Commands

The first step is to rename the variables

ren (State IMR2020 IMR2015 IMR2010) (statename imr_2020 imr_2015 imr_2010)

IMR for Telangana for the year 2010 is not available as the state was formed in 2014. The value for the same is given as “NA” in the original visualization, which makes Stata read it as a string variable. Hence, we destring the variable first,

destring imr_2010, replace force

Note, the force option is required here we are telling to destring a string observation.

Next step is to calculate the improvement in IMR

gen improvement = ((imr_2010 — imr_2020)/imr_2010)*100format improvement %5.1f

Now comes the fun part (for me personally) where we set up our data to plot numeric values of IMR in the form of value labels in scatter plots. This step is crucial for the visualization as we need the IMR values for all years in straight rows and plotting them as they are will be not give the same result. To overcome this problem, we need to plot the values as an increasing number sequence and assign the actual IMR values as value labels.

The first step in achieving the above is to convert our numeric IMR values into string values.

tostring imr_2020, replace forcetostring imr_2015, replace forcetostring imr_2010, replace forcereplace imr_2010 = "NA" if imr_2010 == "."

Next we generate a variable for storing the ranks of states as per the original visualization.

*** creating a variable for state rankscap gen st_rank = .replace st_rank = 1 if statename=="India"replace st_rank = 2 if statename=="Telangana"replace st_rank = 3 if statename=="J&K"replace st_rank = 4 if statename== "Delhi"replace st_rank = 5 if statename=="Kerala"replace st_rank = 6 if statename=="Karnataka"replace st_rank = 7 if statename=="Andhra Pradesh"replace st_rank = 8 if statename=="Gujarat"replace st_rank = 9 if statename=="Punjab"replace st_rank = 10 if statename=="Tamil Nadu"replace st_rank = 11 if statename=="Bihar"replace st_rank = 12 if statename=="Maharashtra"replace st_rank = 13 if statename=="Rajasthan"replace st_rank = 14 if statename=="Haryana"replace st_rank = 15 if statename=="Odisha"replace st_rank = 16 if statename=="Jharkhand"replace st_rank = 17 if statename=="West Bengal"replace st_rank = 18 if statename=="Assam"replace st_rank = 19 if statename=="Uttar Pradesh"replace st_rank = 20 if statename=="Uttarakhand"replace st_rank = 21 if statename=="Madhya Pradesh"replace st_rank = 22 if statename=="Chhattisgarh"order statename state st_rank

Now come the interesting part. After converting the IMR values into string variables we now need to encode them so that they can be plotted on scatter plots, however the encode command does not give the desired result as it assigns labels in alphabetical order, whereas we need the labels as per the state ranks.

Here come two user written commands to our rescue. First is sencode and the other is labmask. The sencode command allows us to encode the state names using their respective ranks in sort option to create value labels and labmask command copies the labels of one variable to another.

Some temporary variables are generated to create unique labels as same values of IMR will be allotted same labels. To circumvent this a temporary variable is created concatenating state rank and the respective IMR value, thereby making it unique.

*** encoding statename based on the ranking of statesssc install sencode, replacessc install labmask, replacesencode statename, gen(state) gsort(st_rank)*** converting the numeric values of IMR into labels for scatter plotsgen st_imr_2020 = string(st_rank) + "-" + imr_2020gen st_imr_2015 = string(st_rank) + "-" + imr_2015gen st_imr_2010 = string(st_rank) + "-" + imr_2010sencode st_imr_2010, gen(imr_2010_code) gsort(st_rank)sencode st_imr_2015, gen(imr_2015_code) gsort(st_rank)sencode st_imr_2020, gen(imr_2020_code) gsort(st_rank)gen imr_2010_code_1 = imr_2010_codegen imr_2015_code_1 = imr_2015_codegen imr_2020_code_1 = imr_2020_codelabmask imr_2010_code_1, values(imr_2010)labmask imr_2015_code_1, values(imr_2015)labmask imr_2020_code_1, values(imr_2020)*** dropping the temporary variablesdrop st_imr_2010 st_imr_2015 st_imr_2020 imr_2010_code imr_2015_code imr_2020_coderen (imr_2010_code_1 imr_2015_code_1 imr_2020_code_1) (imr_2010_code imr_2015_code imr_2020_code)*** saving the datasetsave "SRS bulletin.dta", replacecompress

Step 3: Data Visualization

After getting the data in shape, we now start putting together the visualization. The visualization is created in two parts and then both are merged using the graph combine command in Stata.

First part of the graph

First part of the process deals with plotting the state names and the IMR values for 2010, 2015 and 2020. This is achieved by plotting four scatter plots together, one for the State names and the other three for IMR values. The headings given are plotted using text boxes.

*** first part of graph ****** creating variables for axis of scatter plotscap gen three = -3cap gen four = -4cap gen five = -5cap gen seven = -7*** scatter plots for state names and IMR valuesscatter state seven, mlabel(state) mlabcolor(black) msymbol(none) mlabsize(medium) mlabgap(2) ysc(reverse) || scatter imr_2020_code five, mlabel(imr_2020_code) mlabcolor(black) msymbol(none) mlabsize(medium) ysc(reverse) || scatter imr_2015_code four, mlabel(imr_2015_code) mlabcolor(black) msymbol(none) mlabsize(medium) ysc(reverse) ||scatter imr_2010_code three, mlabel(imr_2010_code) mlabcolor(black) msymbol(none) mlabsize(medium) ysc(reverse) title(" ") xscale(off) xtitle(" ") ytitle(" ") ytitle(" ", size(tiny)) yscale(noline) ylabel(none) xlabel(-6.5 " " -6 " " -4" " -2 " ") legend(off) text(-1 -5 "{fontface Arial Bold:2020}" "{fontface Arial Bold:(May 2022 bulletin)}", size(tiny) color(white) box just(center) linegap(1) margin(l+2 t+1 b+1 r+1) fcolor(navy%80) lw(none)) text(-1 -4 "{fontface Arial Bold:2015}" "{fontface Arial Bold:(Dec 2016 bulletin)}", size(tiny) color(white) box just(center) linegap(1) margin(l+2 t+1 b+1 r+1) fcolor(red%60) lw(none)) text(-1 -3 "{fontface Arial Bold:2010}" "{fontface Arial Bold:(Dec 2011 bulletin)}", size(tiny) color(white) box just(center) linegap(1) margin(l+2 t+1 b+1 r+1) fcolor(blue%60) lw(none)) xsize(1) ysize(1) name(g1)

The above code plots four scatter plots of value labels of State name, IMR 2010, IMR 2015 and 2020 at x-axis position of -3, -4, -5 and -7 respectively. The rest of the commands are for customizing the graph.

Second part of the graph

The next step is to plot horizontal bar graph for improvement in IMR. This is the simplest step in the entire process.

*** second part of graph ****** bar graph for improvement in IMRgraph hbar (mean) improvement, over(state, axis(off)) missing bar(1, fcolor(cranberry)) intensity(80) blabel(bar, color(black%100) format(%5.1f)) ytitle(, size(tiny)) yscale(off) ylabel(0(10)70, valuelabel labsize(zero)) xsize(1) ysize(1) legend(off) graphregion(margin(0 0 0 6)) text(10 107 "{fontface Arial Bold:Improvement}", size(vsmall) color(white) box just(center) linegap(1) margin(l+4 t+2 b+2 r+4) fcolor(cranberry%90) lw(none)) text(2 93.5 "{fontface Arial Bold: NA}", size(medium) color(cranberry) box just(left) linegap(0) margin(0 t+2 0 0) fcolor(none) lw(none)) name(g2)

Now that we have the two parts of the graph, we simply combine the two using the graph combine command and export the final graph.

The final graph

**** Final graph ****** putting it all together **graph combine g1 g2, title("{fontface Arial Bold: Comparison of IMR in Indian states}", size(medsmall) margin(l+1 t+1 b+4 r+1)) note("{fontface Arial Bold: Source: SRS bulletins, December 2011 to May 2022, *Includes Telangana prior to 2015}", size(vsmall) margin(l+1 t+1 b+2 r+1))*** exporting the graph ***graph export ..\graphs\srs_bulletin_toi.png, replace wid(2000)

All suggestions for modifying and refining both the Stata code and visualization are most welcome and will surely improve my coding skills in future. The replication might not be perfect but it was fun to come up with the code.

Note: This is my first attempt at replicating visualization and it may not be 100% accurate, hence discretion is requested while drawing interpretations from the same.

About the author

I am a Statistical Analyst by profession and have been using Stata since 2016; however, I have really started coding in Stata since 2020. I am based in India and work as Assistant Director for the Census of India. I can be found on Twitter, LinkedIn and also on email (ritul2387@gmail.com).