Use pandas to lag your timeseries data in order to examine causal relationships

Frequently in social sciences, it is difficult to see cause and effect relationships in our data. Here I explore the pandas.shift() function in Python to help us establish temporal precedence in our data in order to derive insights.

Applications where lagging our data is useful include (but are not limited to) analyzing how a change in policy affects patient wellness, and seeing how a new social media strategy implementation influences engagement.

For this example, we are trying to determine the effect of aid provided to countries experiencing crisis. Specifically, aid provided to women and family planning. I chose to look at these specifically since the treatment of women is a proven indicator of state security and peace. Furthermore, looking at women’s health is a proxy for the state of families and innocent civilians of a country in crisis. To see my completed blog on this specific relationship, click here.

So the motivation to do this is to establish a temporal precedence in my data. I want to examine the effect of aid on the percentage of women completing secondary school. If we want to examine causal relationships, we cannot examine these measures at the same time. In order for one factor to cause another, it must precede it.

So to break down the logic into steps:

First off, I want the amount of aid to precede the measure of women completing secondary school.

Secondly, I want political and economic factors to precede the amount of aid provided.

I’m going to lag economic and political variables by two years, and the aid variables by one year, and keep my column containing percentage of women completing secondary school how it is.

I will be creating new columns with “_lag” at the end of each column name.

*Be sure to sort your data by country and year first!*

# lag control variables by two years
count_df.pop_lag = count_df.groupby('COWname')['SP.POP.TOTL'].shift(2)
count_df.gdp_lag = count_df.groupby('COWname')['NY.GDP.MKTP.CD'].shift(2)
count_df.gdp_growth_lag = count_df.groupby('COWname')['NY.GDP.MKTP.KD.ZG'].shift(2)
count_df.gdp_pcap_lag = count_df.groupby('COWname')['NY.GDP.PCAP.CD'].shift(2)
count_df.polity2_lag = count_df.groupby('COWname')['polity2'].shift(2)
count_df.durable_lag = count_df.groupby('COWname')['durable'].shift(2)

Note that I have two columns listed in the groupby; to specify that lagging it by year does not change the country.

# lag aid by one year
count_df.UN_aid_lag = count_df.groupby('COWname')['United Nations'].shift(1)
count_df.US_aid_lag = count_df.groupby('COWname')['United States'].shift(1)
count_df.UN_W_aid_lag = count_df.groupby('COWname')['United Nations-Womens Aid'].shift(1)
count_df.US_W_aid_lag = count_df.groupby('COWname')['United States-Womens Aid'].shift(1)

Here’s a subset of my data that illustrates what we did. Notice that the lagged columns contain the same values as the non_lagged values, but offset by the indicated shift.

Great, now we’re ready to plug this into our model!

Happy coding!

Please check out how I used my lagged data: Humanitarian Aid: Using women’s completion of secondary education to measure aid effectiveness.