Tips to Improve Your Stata Code

Easier collaborations

Published in

Train Your Coauthor

6 min readOct 7, 2024

Most economists are self-taught in coding, primarily in Stata. Unlike professional software developers, our code is often inefficient and buggy due to limited collaboration and the lack of thorough code review. Even with the introduction of data editors in economics, most checks on our code focus on whether the code runs, rather than on its quality or efficiency.

Frequently, we will write or encounter code that looks like this:

* Cleaning the first dataset
import excel "Data/Districts/Region1 Changes.xlsx", clear first

unab varnames: DIST* MUN*
foreach var in `varnames' {
    replace `var' = lower(`var')
    replace `var' = subinstr(`var', ".", "", .)
    replace `var' = subinstr(`var', "'", "", .)
    replace `var' = subinstr(`var', "-", "", .)
    replace `var' = subinstr(`var', " ", "", .)
    replace `var' = subinstr(`var', "(", "", .)
    replace `var' = subinstr(`var', ")", "", .)
}

save "Data/Districts/Cleaned Region 1.dta", replace

* Cleaning the second dataset
import excel "Data/Districts/Region2 Changes.xlsx", clear first

unab varnames: DIST* MUN*
foreach var in `varnames' {
    replace `var' = lower(`var')
    replace `var' = subinstr(`var', ".", "", .)
    replace `var' = subinstr(`var', "'", "", .)
    replace `var' = subinstr(`var', "-", "", .)
    replace `var' = subinstr(`var', " ", "", .)
    replace `var' = subinstr(`var', "(", "", .)
    replace `var' = subinstr(`var', ")", "", .)
}

* Appending Region 1 data to Region 2
append using "Data/Districts/Cleaned Region1.dta"

* Renaming variable for clarity
rename ID DIST_CODE
sort DIST_CODE

* Saving the final combined dataset
save "Data/Districts/Combined Region Changes.dta", replace

This is a relatively decent piece of code — it does what it should. However, it’s neither the best nor the most efficient Stata code you could write.

In this article, I’ll break things down into two parts. First, I’ll explain what the code does well. Then, I’ll discuss where it can be improved and how to implement those improvements.

What the code does well:

Use of relative paths One of the strong points of this code is the use of relative paths, without a single line relying on absolute paths. This improves portability, allowing the code to run across different environments without modification.
Consistency and standardization: The code follows a consistent pattern in naming datasets and cleaning them (for both Region 1 and Region 2). It ensures that variable names are similarly cleaned and standardized by converting them to lowercase and removing unwanted characters (e.g., periods, spaces, parentheses). Overall, the author of this piece of code has made sure that there is uniformity, which is crucial when doing analysis.
Incremental Saving: The code saves intermediate steps (Region 1 cleaned, combined datasets) to separate files. This approach can serve as a checkpoint, allowing for easier debugging or reusing specific datasets without re-running the entire code.

Main areas for improvement and (suggested) solutions:

Spaces in File Names: Using spaces in file names, such as "Data/Districts/Region 1 Changes.xlsx", can cause issues when sharing code across different systems or environments. Spaces can lead to compatibility problems or errors, particularly when working with systems that do not handle file paths with spaces well.

How to Improve? Replace spaces with underscores or dashes, e.g., "Data/Districts/Region1_Changes.xlsx". This makes file names more robust, consistent, and compatible across platforms.

Have a look also at this related article:

Coding-What-Not-to-Do: Best Practices for Data Scraping and Conversion

Some tips on what to do instead

medium.com

Unnecessary Permanent Saving of Data Files: Instead of saving intermediate datasets permanently, we can use tempfile to create temporary files, which automatically get deleted after the script finishes running. This avoids cluttering your directories with unnecessary files.

How to improve? Instead of saving intermediate datasets permanently, we can use tempfile to create temporary files, which automatically get deleted after the script finishes running. This avoids cluttering your directories with unnecessary intermediate files.

Example:

* Create a temporary file for the cleaned data
tempfile Region1

* Cleaning the first dataset
import excel "Data/Districts/Region1_Changes.xlsx", clear first

unab varnames: DIST* MUN*
foreach var in `varnames’ {
    replace `var’ = lower(`var’)
    replace `var’ = subinstr(`var’, ".", "", .)
    replace `var’ = subinstr(`var’, "’", "", .)
    replace `var’ = subinstr(`var’, "-", "", .)
    replace `var’ = subinstr(`var’, " ", "", .)
    replace `var’ = subinstr(`var’, "(", "", .)
    replace `var’ = subinstr(`var’, ")", "", .)
}

* Save cleaned data to the temporary file
save "`Region1'", replace

Code Repetition: The cleaning process for both datasets is duplicated, which makes the code longer and harder to maintain. If the process needs to be updated, changes would have to be made in multiple places, increasing the risk of errors.

How to Improve? The cleaning steps can be modularized into a reusable function or loop, which processes both datasets sequentially. This approach keeps the code clean, reduces repetition, and makes it easier to update if needed.

If you’re not familiar with ado files, here’s a simple and effective solution: you can create a separate do file for the cleaning process and call it whenever needed. This way, you modularize the code without needing to write a full Stata program.

How to Implement:

Create a clean_string.do file:

* clean_string.do: A file to clean and standardize string variables
unab varnames: DIST* MUN*

foreach var in `varnames' {
    replace `var' = lower(`var')
    replace `var' = subinstr(`var', ".", "", .)
    replace `var' = subinstr(`var', "'", "", .)
    replace `var' = subinstr(`var', "-", "", .)
    replace `var' = subinstr(`var', " ", "", .)
    replace `var' = subinstr(`var', "(", "", .)
    replace `var' = subinstr(`var', ")", "", .)
}

Call the clean_string.do file in your main script:

* Cleaning the first dataset
import excel "Data/Districts/Region1_Changes.xlsx", clear first
do clean_string.do
tempfile Region1
save "`Region1'", replace

* Cleaning the second dataset
import excel "Data/Districts/Region2_Changes.xlsx", clear first
do clean_string.do

* Renaming variable for clarity
rename ID dist_code
sort dist_code

append using `Region1'
save "Data/Districts/combined_regions.dta", replace

Why this works well:

Modularity: The cleaning process is stored in a separate file (clean_string.do), making the main script cleaner and easier to follow.
Reusability: You can use the clean_string.do without duplicating code.
Simplicity: It’s easier than writing a full ado file or program, especially if you prefer a straightforward solution.

And if you’d like an extra boost for your code, you can streamline all the steps with a forvalues loop to avoid repetition:

* Loop through the regions (Region1 and Region2) and save them to tempfiles
forvalues i = 1/2 {
    * Import each dataset
    import excel "Data/Districts/Region`i'_Changes.xlsx", clear first
    
    * Call the external cleaning script for string replacements
    do clean_string.do
    
    * Save the cleaned dataset as Region`i'
    tempfile Region`i'  * Store as Region1 or Region2
    save "`Region`i''", replace
}

* Since Region2 is still in memory, append Region1 to Region2
append using "`Region1'"

* Renaming and sorting for consistency
rename ID dist_code
sort dist_code

* Save the final combined dataset
save "Data/Districts/combined_regions.dta", replace

Why this loop works well:

Efficiency: The loop reduces code repetition by processing both datasets in a single sequence.
Scalability: This structure allows you to easily add more datasets in the future if necessary, without having to duplicate code.
Advanced Approach: While this approach makes the code shorter and more efficient, it’s perfectly fine to stick with the previous example, which is more straightforward to read and understand if you’re just starting out.

Personal Note: I really dislike capitalized variable names. While there’s no strong technical reason against using them, I find switching to capital letters while coding quite annoying. If you’re one of my co-authors reading this, please take note. 😉

Some time ago, we wrote a guide on Stata for Economists, which you may find useful. Take a look: https://datacarpentry.org/stata-economics/

Thank you for taking the time to read about my thoughts. If you enjoyed the article feel free to reach out at arieda.muco@gmail.com or on Twitter, Linkedin, or Instagram. Feel free to also share this article with others.