Leveraging Python with Stata: Part 1

Collin Zoeller
11 min readAug 11, 2024

--

An under-utilized feature can dramatically decrease your data headaches and improve your Stata workflow.

modern computer monitor with code and data symbols on the screen
Image Generated by DALL-E

If you use Stata for research, work, or education, you have no doubt encountered some frustrating roadblocks that have nothing to do with your data and everything to do with Stata. This is especially true if you approach programming in Stata the same way as in Python. Maybe you have spent hours in the documentation or the help menu attempting to “hack” your way through. I have been there. I went so far as to tune a custom GPT to translate Python code into Stata and vice-versa.

“The most disastrous thing that you can ever learn is your first programming language.” — Alan Klay

Fortunately, Stata 16+ leverages the strengths of both Python and Stata in a surprisingly unified way. Its Pystata API and the ability to access Python during the Stata instance greatly improve productivity when building data tasks. In this two-part series, we will walk through how to work with the in-instance Python feature and the Pystata API to demonstrate several cases in which both are especially useful. This article will only cover Part 1 of the series: using Python in Stata. Check out Part 2 to learn how to use the Pystata API to run Stata through Python.

Prerequisites:

  • Stata 16+ license and software
  • Python 3.0+
  • Pandas library 1.0.0+
  • basic familiarity with Python and Stata

Accessing Python in Stata

Before accessing certain Pythonic features in Stata, we need to set the PATH to include the Stata-built-in packages shipped with the software. While the code used here is specifically for Unix devices (Linux and MacOS), the concept is similar and extendable to any platform. Be sure to check the documentation for your specific platform.

First, let’s get the path to the Stata distribution. We can do this by running this command in Stata.

. display `c(sysdir_stata)'

Let’s say the Stata distribution is located in usr/local/stata. Then after opening .bash_profile in the shell, we’ll simply add the /utilities folder, save and quit the editor, and reload the terminal.

For more information about editing .bash_profile, consider reading this Medium article or StackExchange forum.

export PATH = $PATH:usr/local/stata/utilities

Once this one-time setup is completed, accessing Python in the console or do-file is pretty straightforward:

python:
#pythonic code
end

** Stata code **

And we can define in-line functions just as easily

python:
def print_number(num)
print("This is the Number: ", num)
return
end

** Stata Code **
scalar number = mod(256, 3)
python: print_number(`number')

When Stata reads python: it activates an instance of Python and treats every subsequent command as Pythonic code until it hits end. Typing python in the Stata console is equivalent to typing python in a bash terminal in that it opens an instance of interactive Python. This means that we can interact with Python in the same way we interact with Stata in the Stata console — except we are writing in Python.

Stata window with the python module activated
Author’s Image

This does not mean we simply changed your input language to Python and can now manipulate our data as we would in Python. the Python and Stata instances exist in different areas of the system’s memory, so this new instance of Python is empty. To allow Python code to affect a Stata process, we lean on a powerful library shipped with the Stata distribution.

Stata Function Interface (SFI)

python:
import sfi # stata function interface
import numpy as np
sfi.Macro.SetLocal("macroname", data) # create local macros
sfi.Macro.GetLocal("macroname") # retrieve a global macro
sfi.Data.GetAsDict(var="varname", missingval=np.nan) # make a dict() of data in memory
end

** Stata code **
di "`macroname'"

Stata and Python code are like little islands in the system’s memory. We can think of Stata’s Stata Function Interface (SFI) Python library as a bridge that connects them, allowing information from either island to move to the other. Because SFI is shipped with Stata, it is not available on package managers like Pipy or Conda-Forge.

Bridges contecting islands in an idealistic tropical scene
Image from DALL-E

There are a host of useful classes in this library, so it would be worth taking a gander at the documentation. This library accesses most Stata functionalities through Python, which extends Stata to a new world of possibilities. Let’s consider a few examples.

Example 1: Ordering a Macro of Files

Suppose we have a directory of estimate files in which the names of all files are formatted uniformly: “{function}_{y}_{x}_{fixedeffects}.ster” where function could be something like reg or logit, y is the output, x represents the regressors and fixedeffects indicate which fixed effect level was used in that regression. You want to load all of these estimate files in order and group them according to the y and x values. You need to do this to output the results neatly using estout.

This can be surprisingly difficult in Stata, and while some Stata packages exist to help order macros, many programmers would probably start with something like this:

clear all
cd ~/estimates
local files = dir: "~/estimates" files "*.ster"
local funcs ""reg" "logit""
local ys ""price" "n_sold""
local xs ""size" "color" "
local fixedeffects " "state" "year" "state&year" "

foreach func of local funcs{
foreach y of local ys{
foreach x of local xs{
foreach file of local files{
foreach fe of local fixedeffects{
if "`file'" == "`func'_`y'_`x'_`fe'.ster"{
estimates use `file'
estimates store `func'_`y'_`x'_`fe'
estout * using outfile.csv, append
estimates clear
}
}
}
}
}
}

Notice this method requires a five-deep nested for-each loop and a conditional statement. In addition to specifying 5 local macros, each for-each loop easily exponentiates the time complexity of your task, meaning that the marginal time required to finish the task increases (i.e. the slower the process is) for each new file we want to add.

Now let’s try this task again using the power of Python.

clear all
cd ~/estimates

python:
from sfi import Macro
from glob import glob
import os

def sort_key(filepath):
filename = os.path.basename(filepath)
return filename.split(".")[0].split("_")

files = " ".join(sorted(glob(os.path.join(os.getcwd(), "*.ster")), key=sort_key))
Macro.setLocal("files", files)

end

** Stata Code **
foreach file in local files{
estimates use `file'
estimates store `func'_`y'_`x_`fe'
estout * using outfile.csv, append
estimates clear
}

The Python code first defines a function sort_key() that returns a list of the identifying information separated by an underscore in the filename. This becomes the key on which the sorted() function sorts the list of file paths.

Next, it searches for all filenames in the directory with the .ster extension (the default format for Stata estimate files), sorts them on the defined key, returns a string concatenation readable by Stata’s macros, and pushes that organized list into a macro. Stata then takes over and iterates over this freshly ordered macro.

Simply ordering the macro before execution reduces a 5-deep nested loop to a single loop, reducing computational and visual complexity.

Example 2: Hack to Read and Write Parquet Files in Stata

The Parquet file format has recently become popular for its optimal memory requirements and fast read/write speeds, which is especially attractive for large datasets. As of Stata 18, Stata cannot directly read Parquet files. We can use our data “bridge” to bypass this limitation and send the Parquet data to Stata using the Pandas library (or any Parquet reader) in Python. We can easily extend this process to any file readable to Pandas, including SAS, JSON, HTML, SQL tables, etc.

clear all
cd ~/data

python:
import pandas as pd
from sfi import Data
df = pd.read_parquet(os.path.join(os.cwd(), "data.parquet"))
# store df in Stata's memory
Data.store(var=list(df.columns), val=df.values.to_numpy())
end

desc

Saving to a .parquet (or other Pandas-supported format) using Python is just as easy.

python:
from sfi import Data
import pandas as pd
import os

def save_parquet(filepath):
statadata = Data.getAsDict() # returns dataset in memory as a dict
data = pd.DataFrame(data=statadata.values(), columns=statadata.keys())
data.to_parquet(os.path.join(os.getcwd(), filepath))
return
end

cd ~/data
use data.dta

python: save_parquet("newdata.parquet")

Advanced Methods: Creating a Macro of Commands

Let’s think of a situation similar to Example 2, except now instead of iterating over estimate files we are going to run commands that generate the estimation files. Here is the task:

We are given 6 output variables of interest and want to consider how the conditional means change when we vary the regressors among 2 choices. We are also interested in how these effects differ among 6 subgroups or partitions of the data (including no partitions) and if the results are consistent when we vary the fixed effect levels twice.

In all, these are 6 outputs x 2 regressors x 6 partitions x 2 fixed effect levels = 144 different regressions to run!

It is feasible to do this in Stata in the same way we ordered the macro in Example 2: by creating several macros of each parameter to vary and then iterate through each.

clear all
cd ~/estimates

local ys y1 y2 y3 y4 y5 y6
local xs x1 x2
local fes " "state year" "state" "
local partitions1 " "var1>1" "var1<1" "" "
local partitions2 " "var2==1" "" "

foreach y of local ys{
foreach x of local xs{
foreach fe of local fes{
foreach p1 of local partitions1{
foreach p2 of local partition2{
if `p1' != "" & `p2 != ""{
reg `y' `x' i.(`fe') if `p1' & `p2'
}
else{
reg `y' `x' i.(`fe') if `p1'`p2'
}
}
}
}
}
}
}

Let’s view this problem from a different angle.

At the end of the day, we just want to run similar specifications with varying parameters. for example,

reg y1 x1 i.fe1 i.fe2
reg y1 x1 i.fe1
reg y1 x1
reg y1 x1 i.fe1 i.fe2 if partition_var=1
reg y1 x1 i.fe1 if partion_var =1
reg y1 x1 if partition_var=1
...
reg y6 x2 i.fe1 i.fe2 if partition_var=1
reg y6 x2 i.fe1 if partion_var =1
reg y6 x2 if partition_var=1

We know that Stata can run a series of commands if we put them in a macro.

local cmds  " "reg y1 x1 i.fe1 i.fe2;" "reg y1 x1 i.fe1;" "reg y1 x1;" "
foreach cmd of local cmds{
#delimit ;
`cmd';
#delimit cr
}

It would certainly be tedious to manually input all 144!

We could instead use Python to create a cartesian product of the arguments and build the macro automatically for all 144 regression specifications.

First, let’s go to Python and define the arguments. Below is only Python code (instead of Python within Stata) for the sake of clarity and reproducibility.

# Define arguments: output, regressors, fixed effects, and subgroups
args = {
"y": ['y1', 'y2', 'y3', 'y4', 'y5', 'y6'],
"x": ['x1', 'x2'],
"fe": ["state year", "state"],
"partition":
['var1>1', 'var1<1', '', 'var2==1', 'var1>1 & var2==1', 'var1<1 & var2==1' ]
}

Now let’s define the skeleton of the command we want to pass into Stata. We’ll put placeholders for the values that will change.

eststo: reg {y} {x} i.({fes}) if {partition}

We want to create a macro that lists all possible combinations of arguments in the command. This is where the Cartesian product comes in handy.

from itertools import product

cartesian_product = list(product(*args.values()))
print(len(cartesian_product))
for tup in cartesian_product:
print(tup)

When you run this in Python, you get a result like this:

144
('y1', 'x1', 'state year', 'var1>1')
('y1', 'x1', 'state year', 'var1<1')
('y1', 'x1', 'state year', '')
('y1', 'x1', 'state year', 'var2==1')
('y1', 'x1', 'state year', 'var1>1 & var2==1')
('y1', 'x1', 'state year', 'var1<1 & var2==1')
('y1', 'x1', 'state', 'var1>1')
('y1', 'x1', 'state', 'var1<1')
('y1', 'x1', 'state', '')
('y1', 'x1', 'state', 'var2==1')
('y1', 'x1', 'state', 'var1>1 & var2==1')
('y1', 'x1', 'state', 'var1<1 & var2==1')
('y1', 'x2', 'state year', 'var1>1')
('y1', 'x2', 'state year', 'var1<1')

...

('y6', 'x2', 'state year', 'var2==1')
('y6', 'x2', 'state year', 'var1>1 & var2==1')
('y6', 'x2', 'state year', 'var1<1 & var2==1')
('y6', 'x2', 'state', 'var1>1')
('y6', 'x2', 'state', 'var1<1')
('y6', 'x2', 'state', '')
('y6', 'x2', 'state', 'var2==1')
('y6', 'x2', 'state', 'var1>1 & var2==1')
('y6', 'x2', 'state', 'var1<1 & var2==1')

Now it is just a matter of putting the arguments in the command skeleton!

from itertools import product
from sfi import Macro

# Define arguments: output, regressors, fixed effects, and subgroups
args = {
"y": ['y1', 'y2', 'y3', 'y4', 'y5', 'y6'],
"x": ['x1', 'x2'],
"fe": ["state year", "state"],
"partition1": ['var1>1', 'var1<1', 'None'],
"partition2": ['var2==1', 'None']
}

cartesian_product = list(product(*args.values()))

cmds = []

# Create the command string
for item in cartesian_product:
cmd = f"eststo: reg {item[0]} {item[1]} i.({item[2]}) if {condition}"
cmds.append(cmd)

# create a string list of integers over which to iterate
counts = " ".join(list(map(str, list(range(len(cmds)))))) # Macros are separate by " "
Macro.setLocal("counts", counts)

def get_cmd(i):
Macro.setLocal("cmd", cmds[int(i)])
return

This code creates a list of tuples containing the cartesian product of all parameters in args. It then iterates through this list and replaces the placeholders in the skeleton we developed earlier with the corresponding parameter value in each tuple and adds the resulting string tocmds. This is now our corpus of commands to run. counts is a list of integers corresponding to the index in cmds so that we easily iterate through the list of regressions.

get_cmd() takes whatever count of the iteration the code happens to be on and sets a macro for the corresponding command of our corpus. Iterating through these commands in Stata is now relatively easy.

*** Stata Code***
foreach i of local counts{
python: get_cmd(`i')
# delimit ;
`cmd' ;
# delimit cr
}

As a recap:

  1. define the space of arguments
  2. define a command skeleton
  3. create a cartesian product of the arguments to get all possible combinations
  4. input the arguments into the skeleton for each Cartesian tuple
  5. create a list of the command strings
  6. create a macro counts of integers of size len(command list) to access each command and place that command in the current cmd macro
  7. iterate over counts to get a new command, and then run the command

To get the resulting estimate files like those of Example 1, we can just tweak the get_cmd() function to create a macro for the filename.

def get_cmd(i):
i=int(i)
filename = f"{' '.join(cartesian_product[i])}.ster" # unique name from parameters
Macro.setLocal("filename", filename)
Macro.setLocal("cmd", cmds[i])
return

Then we adjust the Stata chunk accordingly,

*** Stata Code***
foreach i of local counts{

python: get_cmd(`i')
# delimit ;
`cmd' ;
esttab * using `filename';
# delimit cr
}

And done!

Read Further: Read Part 2 to see how we could go even further and parallelize these regressions using Python!

Conclusion

Integrating Python into a Stata workflow can significantly enhance productivity by leveraging the strengths of both languages. Stata’s ability to run Python code within its environment opens up a range of possibilities, from simplifying complex tasks to enabling the use of modern data formats like Parquet. By utilizing Stata’s built-in Python interface, data analysis processes can be streamlined, computational complexity reduced, and more sophisticated tasks tackled with relative ease. Whether organizing files, processing data, or automating repetitive tasks, the combination of Python and Stata provides a powerful toolkit that can save time and produce much more comprehensible code. Be sure to check out part 2 of this series, where we explore how to use the Pystata API to access Stata functionalities in Python.

Follow me for more publications in the world of computational data science!

--

--

Collin Zoeller
0 Followers

Data Scientist, ML Enthusiast, and Research Associate at Carnegie Mellon University and the Internal Revenue Service