The Stata workflow guide

Asjad Naqvi
The Stata Guide
Published in
24 min readJun 7, 2021

--

Last updated: Aug 2022

When we start using softwares, like Stata or others, we often overlook structuring our files and folders. Most of the time, neatly organizing the data and code is left as a last step, for example, when finalizing some articles. I have also seen researchers only think about organizing files and folders once they get some revisions, or are expected to submit their files for replication. This is exactly the approach we want to avoid.

Workflow management is extremely important and organizing our folders should be done as a first step otherwise a lot of time is lost later figuring out what we actually did, where the files actually are, and where the latest code is saved. This also makes it easier not only for ourselves but also for our co-authors and collaborators or even others trying to replicate the code. A good litmus test of a good workflow system is that when we open an old project, we can immediately figure out the purpose of each file and folder.

In this guide, I will cover some basic principles of workflow management. This article will not discuss coding but rather how to organize the code and all the files around it. While here I use Stata as an example, the broad workflow tips discussed here can be extended to any project in any software. Additionally, these are not hard rules, but rather guidelines for best practices to get you started with organizing your research projects.

The guide is in six parts that cover the following areas:

  • Part 1: Organizing folders
  • Part 2: Naming conventions for files and folders
  • Part 3: The importance of relative file paths
  • Part 4: Splitting tasks across different dofiles
  • Part 5: The master dofile: put everything together
  • Part 6: Styling the code

So let’s get started!

Part 1: Sort EVERYTHING into folders

Over the life cycle of a project, we accumulates files. And lots of them. Just on the data analysis part, this includes getting raw data, writing scripts to process the data, putting together data for analysis, generating tables, graphs, and figures. And, of course, there are versions of all of the above.

Additionally, research is not just the data part. It also includes accumulating literature, and writing articles, project documents, reports, deliverables. All of these also have many versions.

ALL of these files should be sorted in folders. If you have followed my other guides, then you would have noticed that I always recommend following the folder structure:

Let’s look at these folders carefully:

  • raw: contains all the raw data files
  • dofiles: contains all the scripts to process, clean, and analyze the raw files
  • temp: contains intermediate files that are generated from the raw data
  • master: contains the final data that is ready for analysis
  • GIS: contains the spatial layers
  • graphs: contain the figures

This list is not exhaustive, and can vary from project to project, but it is a good starting point. We can also way add more folders for more control. For example, we can differentiate between small projects (e.g. single papers) and large projects (large datasets, multiple papers).

Organizing small projects

For small projects or standalone papers, I recommend a folder structure that looks something like this:

First of all, number the folder in the order they are used. For example, you will get raw data first (01_raw), then you will write the scripts (02_dofiles). While writing the scripts, you will create temporary files (03_temp), before that are saved in the master folder (04_master). You might also have spatial data that can do in its own folder (05_GIS). Once the data is clean, it can be used to generate tables (06_tables) and figures (07_figures).

Adding a number prefix to folders keeps the sequence in check. Plus if anyone uses this folder, they can also sort of follow the logical order of the workflow. Numbering systems can also help us sort the files in numerical order, otherwise they appear alphabetically. This really doesn’t tell us much about the sequence in which they are used. This can create problems, even with our own projects. Furthermore, several times, I have downloaded replication files, and folders are placed in an alphabetical order, which can be confusing since they don’t correspond to the workflow.

TIP: Never have spaces in folder or file names

While this is not so much of an issue currently, since operating systems can deal with them, they can still create issues on certain operating systems. Sometimes you even see spaces replaced with %20 in URLs. Similarly other characters like ~#@!\%^& and - (basically most of the symbols) should be avoided at all costs. Most of these are used for programing and usually require some special treatment to be included in scripts. In short, skip the special characters.

TIP: Never use hyphens (-) or other special characters in folder or file names. If you need to provide spaces ONLY use underscores (_).

Technically Stata can read names with spaces if they are in double quotes "". But as a good practice use underscores whether it is for folder names or file names.

TIP: Avoid capital letters in file, folder, and variable names.

Capital letters, technically are not an issue, but they are considered bad practice. The aim of scripting and programming is to develop a flow, and having to deal with capital letters does affect the speed. If I see capital letters in variables names, I just make them lower case unless there is some special reason to key them in upper case. Properly formatted text should be reserved for variable and value labels.

Variables names should also be kept as simple as possible. Stata has a 32 character limit for variable names. Trying to stay within this limit is usually considered good practice. Also keep this in mind, if you are generating primary data through surveys, and questionnaires. If you do get messy data, that has wierd or very long variable names, try and clean them up as much as possible. Pattern replacing names using string functions, or regular expressions are highly useful for these tasks.

Organizing large projects

If you are working on large projects, which involve multiple datasets, several papers, and project deliverables, then I would suggest splitting the data setup from the analysis part.

For large projects, I use this type of a folder structure:

Here the “data” folder is just raw data and scripts that clean up the raw files. The aim here is to generate the final data files in the master folder. This setup works well for projects that are fairly large in terms of size and complexity of handling and managing datasets.

The aim of the scripts in the data folder is to put everything together after doing all sorts of spatial merges, cleaning, processing etc., which can have their own set of challenges. Therefore, using a separate folder for data can also help us document the steps. Once the final dataset is created, the large scripts that put the data together sort of become redundant unless some modifications are required. Once the master file is created, this portion of data management usually remains untouched.

The master data files are picked up by paper-specific dofiles in the “paperX” folders. The paper folder follows the same structure the small project folder. It also contains a setup file since where one can merge, append, modify, the data files to create a paper-specific master datafile. Here essentially the master dataset in the “data” folder play the role of raw data for the paper folder.

TIP: Throw out the variables you don’t need for specific papers. Every unused data point is a burden on the memory.

Useless variables that are loaded in the dataset will take up memory and if the file is large, your computer will feel it as well. Plus this also helps with the analysis part especially if one doesn’t have to scroll through some 200 variables. See the Stata Tips article on some tips on optimizing the variable storage.

TIP: Write up a readme.txt file.

This might seem a pain, but leaving readme files help a lot! For example, they include information about folder and files and the sequence and order on how to use them. More importantly they can also contain information on where and how the data was accessed. Here I would even suggest to leave notes in your dofiles on URLs from where one can find the raw files. These could be links or even table numbers from large datasets like the World Bank, Eurostat, OECD etc. There have been times, when I have opened some old projects and could not figure out where to update the data from. This seems trivial during the data cleaning process but one forgets later on!

Part 2: Get those filenames in order

Files, especially scripts, should be sequenced so that you do not have to spend a lot of time figuring out which script is the latest. Usually, I use not only prefixes, but also suffixes in file names. The prefixes are the order in which files are used, and suffixes are version numbers.

Having sequence numbers prefixed to dofiles helps sort them out and align them with the workflow, while suffixes track the version numbers.

For example, I use variations of the following structure:

01_setup_v4.do
02_merge_v2.do
03_tables_v11.do
04_regressions_v15.do
05_figures_v2.do
06_master_v1.do

Here you can see that I use _v1, _v2, etc. This can be replaced with dates as well. For example, in some projects, I will have 01_setup_210531.do, 01_setup_210602.do etc.

TIP: When using dates follow yymmdd (year-month-date) format. This will automatically sort the files in the correct order with the latest one showing first.

Here I would reiterate, avoiding the date-month-year or the American month-date-year system since these cannot be chronologically sorted.

TIP: Move the older files away from the main folders.

Don’t collect dozens of versions in the same folder. At some point, older versions will become redundant or useless. Delete them. Or if you are a collector (like me), throw them in a temp folder. I always have a temp folder for old dofiles, tables, graphs. Keep your main folders and sub-folders as neat as possible. Especially if you are working on a project for months, then do some house cleaning once in a while. There is a good chance that those versions from a few weeks back won’t be useful anymore and there is a chance that you might accidentally mix up the scripts by opening the wrong file.

The naming convention should not be limited to just dofiles. Do this for all the graphs, tables, paper versions etc. It might seem like a pain, but in retrospect, you won’t regret it. Especially if you don’t have to sift through hundreds of versions to find the one that you really need.

Part 3: Using relative paths

Relative paths are key to a smooth workflow. Here the idea is that you should set the path to your main directory once. From this point on, everything should be relative to that directory. It is sometimes ok to directly set a subdirectory as well, for example, due to restrictions of some packages that do not allow cross folder loading and/or saving of files (e.g. spmap). But if you can avoid this, then avoid it.

So what do we mean by all of this. Let’s go back to the first folder structure shown at the beginning:

Let’s assume that the main folder is called “myproject” and it is on your D: drive (on Windows). We can point to this folder in the dofile as follows:

clear
cd "D:/myproject"

Note here that I use clear, which clears everything in the memory. Then I specify cd <folder path>. Also note here that I use forward slash /.

TIP: For file paths ALWAYS use forward slashes /.

Backslashes \ are reserved for special operations that can sometimes causes errors under some circumstances. It does work MOST of the time, but avoid using it!

The folder name has no spaces or other characters so the double quotations are not necessary. But if the path is enclosed in double quotes, it should not matter.

Here I would like to point to another scenario. Assume you have your files synced on Dropbox and you are working across multiple computers (e.g. home and office). It is very likely that the paths will be different across the computers. While there are packages to find the Dropbox main folder path, a simple trick is to just use Stata’s capture command:

clear
cap cd "C:/Program files/Dropbox/myproject" // home
cap cd "D:/Program files (x86)/Dropbox/myproject" // office

The capture or cap command skips Stata errors and moves on to the next line. Without capture Stata would just give a red error code on the screen and will stop the script. This makes cap an extremely risky command and it should be used with caution especially when doing analysis. There are of course, other ways of setting file paths, for example in the Stata profile file (see help profile), or using some filepath packages, e.g. dropbox (ssc install dropbox), but I leave these up to you to explore.

In the above code we are telling Stata to set the first path. If it doesn’t find the path, it will give an error. But the capture command says, skip the error and execute the next line where another path is defined. Either one of the two has to be correct, otherwise the remaining script will not work. But Stata will let you know in this case. If you are collaborating with others, for example on Dropbox or some network drive, then you can keep adding as many cap cd options as you want.

Also note the use of // at the end of the line for commenting on the code.

TIP: use // to comment in front of the code!

Here // are particularly useful since they allow you to comment after a code is written. For example:

*** here is my first regression
reg y x1 x2 x3 // some comment here

whereas * are basically used for marking out code but can also be used to leave notes on new lines. But more in commenting in the next section.

Coming back to relative paths, if the above code runs correctly, we should be in the main “myproject” directory. For this point onward the subdirectories can be accessed via relative paths. For example, if you have a data file “rawdata.xls” in the raw folder. It can be read as follows:

import excel using ./raw/rawdata.xls, clear first

The more crude way of doing it would be:

import excel using "C:/Program files/Dropbox/myproject/raw/rawdata.xls", clear first

and here you would have to correct the path everytime, especially if you are working on multiple computers. Chances of mistakes and errors are also high when using full paths, making it a frustrating experience. Also note that since there are spaces in the folder names everything is enclosed in double quotes.

Similarly once you are done with cleaning up the dofile, you can then use relative paths to save it in the temp folder:

import excel using ./raw/rawdata.xls, clear first<some data cleaning stuff>order var1 var2 x* y*
sort var1 var2
compress
save ./temp/rawdata.dta, replace

The middle part, where we generate variables, clean them, find outliers, label them, label the values etc., is sort of what most people learn first. So let’s focus on the last part where we finish up after we are done with the cleaning.

First step, order the variables using order. This helps organize the columns. For example if you have a data which has variables that are something like countryid, countryname, provinceid, provincename, districtid, districtname, year, month etc, then order these in the dataset as well. What is not specified in the order list is moved after the ordered variables (see help order for more details).

Next we sort the data to make it look neater when viewing the files. We also compress the variables to store them more efficiently. For example text variables might by stored with a larger character length then required, maybe due to white spaces. Compress also checks if numeric variables can be efficiently stored, for example, by converting doubles to floats or integers (to learn more about number precision, see the Power of Precision guide). Compress sometimes helps save tremendous amounts of space especially when using a very large dataset.

In the last step, the files are saved in temp folder ./temp/ which also makes use of the relative path. Here I use the same name as the raw file. This is also not necessary but if you are cleaning dozens of files, the last thing you want to do is to figure out which file was saved as which file. It is usually good if you can trace your intermediate files back to the raw data files especially if you are generating dozens of them using some loops. For one or two raw files, this is not really necessary but good to start internalize these rules!

Moving in and out of sub-directories and relative paths

Let’s now explore the scenario where you change the directory to a subfolder. For example, you might want to switch to the GIS directory to make life easy when dealing with shapefiles. But this applies to any subfolder. Changing to a subdirectory in Stata can be done in two ways:

clear
cd "C:/Program files/Dropbox/myproject/GIS"

or using relative paths:

clear
cd "C:/Program files/Dropbox/myproject/"
cd ./GIS

Once in the GIS folder, you can process the shapefiles etc. Let’s say you want to generate a map and save it in the “graphs” folder, which is one directory up and then another directory down again. Again we can make use of relative paths here:

clear
cd "C:/Program files/Dropbox/myproject/"
cd ./GIS
<some shapefile processing here>
spmap <commands>
graph export ../graphs/map1.png, replace

Here note that to access the graphs folder we use double dots .. which means move up one directory and then go down one directory in the graphs folder.

Similar logic can be applied to navigate the directory tree. Let’s say you were in another directory inside GIS called for example “country1”. The relative path to the graphs folder would be defined as:

clear
cd "C:/Program files/Dropbox/myproject/"
cd ./GIS/country1
<some shapefile processing here>
spmap <commands>
graph export ../../graphs/map1.png, replace

where we move twice up ../../ and down once graphs/.

Using this sort of coding is fairly confusing. Try and stick to the root directory and so all the directories are one level below. But if you do need to move around so much within folders then the next step helps:

File paths using macros (globals and locals)

Storing file paths in macros is probably the most used option when looking at replication files. Macros basically store the information either temporarily (local) or permanently (global). Locals disappear after a code instance ends while globals are permanently stored in memory until you close Stata. This makes globals a bit risky as well potentially resulting in unintended consequences. But this is not so relevant for this part. What authors usually do is that they define the key paths at the very beginning in globals:

clear*** replace this with your main directory path
global projectdir "C:/Program files/Dropbox/myproject/"
global graphdir "$project/graphs"
global tabledir "$project/tables"
cd "$projectdir"
cd "$graphdir"
cd "$tabledir"

Notice how the main project path has the full directory path while the graph and table paths are relative paths. Therefore, here you just need to define the projectdir global and the rest should sort itself out. Globals are called with the dollar sign $graphdir.

This can be done with locals as well but this means that everything has to run in one go so stick to globals if you are swapping between files and working on different code chunks.

TIP: Make sure global names don’t clash with variable names.

Try and keep global names as unique as possible in order to avoid scripts getting messed up.

Now one can go in which ever directory and use the global names to access the correct folders. The above example with globals would look like:

clearglobal projectdir "C:/Program files/Dropbox/myproject/"
global graphdir "$project/graphs"
global tabledir "$project/tables"
cd "$projectdir/GIS"
<some shapefile processing here>
spmap <commands>
graph export "$graphdir/map1.png", replace

This can also be very useful with large projects where the master datasets are stored completely somewhere else. For example, I usually dump very large files on my hard drive and leave the scripts, data subsets, tables, and graphs on Dropbox. Globals therefore help me navigate completely separate parts of the project very efficiently. Just image if you replace the globals or the relative paths with the full paths. It will immediately become messy!

Just for the sake of completeness, if we replace globals with locals, they would look like this:

clear
local projectdir "C:/Program files/Dropbox/myproject/"
cd "`projectdir'"

but this sort of use of locals is highly unusual.

Part 4: Different dofiles for DIFFERENT tasks!

Use different dofiles for different tasks. Repeating the same example provided earlier:

01_setup_v4.do
02_merge_v2.do
03_tables_v11.do
04_regressions_v15.do
05_figures_v2.do

we can see that the first dofile, 01_setup_v4.do, loads the raw data, cleans it up and dumps it in the temp folder. Here one can process all the raw files or even have different setup dofiles for different raw files. For example, I can have

01_setup_emissions_v1.do
01_setup_economic_v4.do

Since they are prefixed with 01_, these files will be grouped together in the folder. The second file 02_merge_v2.do takes all the datasets and puts them together to create the master file. Everything up till here is data cleaning. Unless you are very lucky and have a very clean set of files, this will be 60–70% of the time spent with the data. You can also do version control with dates as suffixes, but remember to stick to the yymmdd rule. The next two files, 03_tables_v11.do and 04_regressions_v15.do conducts the analysis, while 05_figures_v2.do generates the figures.

Some authors also prefer to use a different dofiles for different tables and figures etc. This is absolutely fine. What is important to remember is that the file name should reflect the figure or table number. For example:

03_table1_v1.do
03_table2_v4.do
04_regression1_v2.do
04_regression2_v6.do
05_figure1_v1.do
05_figure3_v5.do

and so on.

At the beginning of a project the sequence of tables figures are usually not clear. It is mostly at the end of the project when things start getting finalized and the sequence of outputs is sort of defined.

Some tips for starting new projects

The initial phase of projects is usually a messy one even if everything is structured in folders etc. Hence dofiles tend to be messy as well. For example, an early stage workflow might look like this:

import data > generate some variables > make some figures > generate more variables > some more figure/tables > modify some data> do some complex loop to generate regressions.

Here I would suggest two things:

TIP: Partition the code within the same dofile

If you are generating different figures/tables from the same dofile, write it up such that you can split it up later into different code blocks within a dofile. Or split it across different dofiles.

TIP: Move the variable generation to the beginning of the dofile or ideally in the setup or merge dofiles.

Ideally, the core variables needed across all the files should go in the setup file. Or if you are generating variables from different files, they can also go in the merge.do file after the cleaned files are put together, and before the final master datasets are saved. If you do need to generate variables in figure/table specific files (for example, if you are collapsing or reshaping data), then don’t do it in the middle of the dofile, bunch them all up together at the beginning of the dofile and comment them as much as possible.

Part 5: Linking the dofiles

In the last section we discussed, using different dofiles for different tasks. Once these files are relatively advanced, the master dofile should run all the other dofiles and compile everything for your paper or project. In essense, the master dofile is essentially one very long script split into different dofiles. Let’s go back to the example above:

01_setup_v4.do
02_merge_v2.do
03_tables_v11.do
04_regressions_v15.do
05_figures_v2.do
06_master_v1.do

The last dofile 06_master_v1.do serves runs all the previous dofiles to compile everything for the project. It has a structure which should look like this:

**** <some project info here> ****clear
cd <set directory here>
<set some locals and globals here>// run all the dofilesdo ./dofiles/01_setup_v11.do
do ./dofiles/02_merge_v2.do
do ./dofiles/03_tables_v4.do
do ./dofiles/04_regressions_v5.do
do ./dofiles/05_figures_v2.do
*** END OF FILE ***

The aim of your project is to reach this file so you can run everything in one go without breaking the code.

Here you can see a sample of my COVID-19 Tracker project done in Stata. The master file is actually two files just because the run time is extremely long and sometimes individual files give errors if there are changes in the raw data that is directly downloaded from the internet. The file on the left runs all the country files while the file on the right appends them and then goes on to prepare the master file and other files for graphs and maps. Here you can also see the use of file numbering, use of relative paths, commenting and spacing. All packages that are needed to run the files and compile the figures etc. are also there but commented out.

Here are three additional tips to make sure this file and individual country files run without hiccups.

Tip 1: Each dofile should be able to run on its own. At no point should you make it dependent on the master file or on defining some directory path etc. A lot of authors who publish their code do this. When sharing their codes, they will strip the individual dofiles of their standalone functionality. Whether this is intentional or not, it is extremely annoying to deal with and maybe even a bit off-putting especially for some researchers who are not comfortable unpacking very complex interlinked code. We will also deal with unpacking code written by other users in another guide. Even if everything depends on the master file, each individual file should have the syntax required for it to run inside the dofile, even if it is marked out.

Tip 2: Similar to the tip above, sometimes authors define globals and locals inside the master file only and strip the individual dofiles of these macros making them useless on their own. For example let’s assume that you have a fixed set of controls that are used in multiple regression files. Rather than copy pasting the variables, one can define a local or global and pass it on to the regressions:

global depvars "var1 var2 var3"xtreg y $depvars
xtreg y $depvars, fe

This makes it compact and neat and one also avoids errors. The set of globals can also be used across multiple dofiles. Therefore if they are defined only once, it is the safest option since they only need to be checked once.

But if they are defined only once, for whatever reason, it is good to leave a note in the individual dofiles on where these variables are stored. Plus naming conventions matter here as well. Names like these will throw anyone off:

$depvars1
$depvars1_1
$depvars11
$depvars2_1

And if these are used in various loops which also do a bunch of various operations besides running regressions then it will be just confusion. One does not notice these things if one is in the flow and one with the code. But later one, this can become messy very fast. Again, when looking at replication files, where one has to “unpack” the code, poor naming conventions make a lot of different.

So try and name the globals in a way which makes it easy to follow for replicability and sure individual files have them duplicated even if they are commented out. Or at least leave some comments on where to find them. For example look at this code:


*** globals for regressions set in the master.do file
foreach x of global depvars {xtreg `x' $indvars1, fe
xtreg `x' $indvars1 $controls1, fe
xtreg `x' $indvars1 $controls1 $controls2, fe
}

Fairly minimalistic for doing a large set of regressions and usually one finds this sort of a structure in dofiles but can also be very confusing if you just see this code without knowing what the globals are or have to hunt them down.

Tip: If you are running multiple dofiles then make sure that when an individual dofile ends executing, it always reset to the directory level that is needed to run the next dofile. In the left screenshot above, individual country dofiles sometimes navigate inside country-specific raw folders for checks and merges, but each country dofile is reset back to the root directory at the very end. This allows me to run the next dofile without errors. If at some point, some dofile gives an error the code will stop and sometimes the paths will be messed up.

In summary, it is OK to make the individual dofiles dependent on the parameters defined in the master dofile (directory, macros, packages etc.) but each dofile should be able to run on its own or it should at least contain the information that allows it to run on its.

Part 6: Code styling

Here I would emphasize the use of three things: commenting your code, splitting your code across lines and using tab for code spacing.

Comments and commenting out code

Stata allows commenting in three ways:

* comments on individual lines// comments on individual lines and after some code/*   // for marking out code blocks*/

Use these as much as possible. The stars can even be used to format your dofiles to make them stand out more. For example, I sometimes use this to start a dofile with notes and comments:

***********************
***********************
*** ***
*** The Stata Guide ***
*** Tutorials ***
*** ***
***********************
***********************

I also use stars to section and partition the data:

************************
*** COVID 19 data ***
************************
insheet using "https://covid.ourworldindata.org/data/owid-covid-data.csv", cleargen date2 = date(date, "YMD")
format date2 %tdDD-Mon-yy
drop date
ren date2 date
save "./data/OWID_data.dta", replace**********************************
*** Country classifications ***
**********************************

copy "https://github.com/asjadnaqvi/COVID19-Stata-Tutorials/blob/master/master/country_codes.dta?raw=true" "./data/country_codes.dta", replace

This just makes it look neater and easier to navigate but adding sort of visual bookmarks.

Stars can be used to mark out individual lines of code as well:

*spshape2dta "wien_building.shp", replace saving(wien_building)
*spshape2dta "wien_leisure.shp" , replace saving(wien_leisure)
*spshape2dta "wien_roads.shp" , replace saving(wien_roads)
*spshape2dta "wien_water.shp" , replace saving(wien_water)
*spshape2dta "wien_railway.shp" , replace saving(wien_railway)

Here we don’t need to generate shapefiles every time since this is a one time process. But we still keep the code in the dofile and mark it out. Sometimes, I also specify packages that need to be installed in the dofiles and mark them out:

** this dofile needs these packages
*ssc install geo2xy, replace
*ssc install palettes, replace

The first code block can be marked out in one go as follows:

/*
spshape2dta "wien_building.shp", replace saving(wien_building)
spshape2dta "wien_leisure.shp" , replace saving(wien_leisure)
spshape2dta "wien_roads.shp" , replace saving(wien_roads)
spshape2dta "wien_water.shp" , replace saving(wien_water)
spshape2dta "wien_railway.shp" , replace saving(wien_railway)
*/

This is particularly useful if you have test code inside your dofiles that is then adapted to work with some loops or other large code.

The two forward slashes can be used to add notes to the dofiles:

* Raw file downloaded from
* https://datahelpdesk.worldbank.org/knowledgebase/articles/906519
gen region = .replace region = 1 if group29==1 // North America
replace region = 2 if group20==1 // Latin America and Caribbean
replace region = 3 if group10==1 // EU
replace region = 4 if group26==1 // MENA
replace region = 5 if group37==1 // Sub-saharan Africa
replace region = 6 if group35==1 // South Asia
replace region = 7 if group6 ==1 // East Asia and Pacific

Here I used * to say where I got the raw file from and // to identify the country groups. Otherwise I would have no idea what these numbers mean!

Splitting the code across lines

Using the three forward slashes /// one can split code across lines. This is really help with making graphs which can look fairly incomprehensible in a single line.

For example look at this single line code of a fairly basic graph:

twoway (connected y1 x1, msize(vsmall))(connected y2 x2, msize(vsmall)), aspect(1) xlabel(-1(0.5)1) ylabel(-1(0.5)1) xline(0) yline(0)

and if we split it up across lines:

twoway ///
(connected y1 x1, msize(vsmall)) ///
(connected y2 x2, msize(vsmall)) ///
, ///
xlabel(-1(0.5)1) ylabel(-1(0.5)1) ///
xline(0) yline(0) ///
aspect(1)

it looks neater. Another way to split it up across lines is to use the delimit option (help delimit) but I am not a big fan of this since /// is fairly fast and easy to use.

For example have a look at this graph code that I regularly post on Twitter:

This makes use of spaces and /// and is very easy to read. This whole code block in one line would drive anyone crazy. At least, here I can easily modify similar elements (like marker sizes or line widths) without having to search for them in one long code syntax.

Tabs for spacing

Tabs (usually the button next to Q on the keyboard) help space the code neatly. This is a highly underused formatting tool and I can’t emphasize enough to use this more. The screenshot above also use tabs for automatic spacing of different code lines.

Let’s space the code shown earlier using tabs:

twoway                               ///
(connected y1 x1, msize(vsmall)) ///
(connected y2 x2, msize(vsmall)) ///
, ///
xlabel(-1(0.5)1) ///
ylabel(-1(0.5)1) ///
xline(0) ///
yline(0) ///
aspect(1)

The last three codes are doing exactly the same thing but the last one way easier to read.

Commenting for sanity

Here is another example where we combine line splitting with spacing and commenting:

drop if ///
_ID==14 | /// // Puerto Rico
_ID==28 | /// // Alaska
_ID==38 | /// // American Samoa
_ID==39 | /// // United States Virgin Islands
_ID==43 | /// // Hawaii
_ID==45 | /// // Guam
_ID==46 // North Mariana Islands

The IDs are also sequenced in numerical order. The same code in a single line would look like this:

drop if _ID==14 | _ID==28 | _ID==38 | _ID==39 | _ID==43 | _ID==45 |  _ID==46

or neater still is the inlist command:

drop if inlist(_ID, 14, 28, 38, 39, 43, 45, 46)

but remember that inlist has limits on the how many conditions can be specified (see help inlist).

Without commenting on the IDs, the job gets done, since the code is perfectly fine, but if I look at this code months later, I will have to reverse engineer the IDs to figure out the country names.

So use these tips generously! They are more for styling code writing and they make life much easier both for yourself and for anyone reading your code.

And that is it for this guide! Hope you found this useful. If I missed something or you have suggestions what else should be covered then please reach out.

About the author

I am an economist by profession and I have been using Stata since 2003. I am currently based in Vienna, Austria. You can see my profile, research, and projects on GitHub or my website. You can connect with me via Medium, Twitter, LinkedIn, or simply via email: asjadnaqvi@gmail.com. If you have questions regarding the Guide or Stata in general post them on The Code Block Discord server.

The Stata Guide releases awesome new content regularly. Subscribe, Clap, and/or Follow the guide if you like the content!

--

--

Asjad Naqvi
The Stata Guide

Here you will find stuff on Stata, data visualizations, data wrangling, workflows, and programming.