How to time shift data in Pandas
What do you do if you want to get a column value from two weeks ago today?
Let’s say you want to make a new column in a dataframe of temperature data from a particular weather station two weeks ago — in other words, for each row, you want to know “what was the average temperature two weeks ago today?”
If you have a dataframe with daily weather data for two weather stations from several months in a row, there are a couple ways you could solve this problem. You could sort the dataframe by date and station, then shift the rows down by 28 rows (remember, we have data from two separate weather stations for each date). However, this solution is slow and computationally costly, and does not take into account potential missing data. (Ie if data were missing for one weather station on one date, every subsequent observation would be off).
Luckily, pandas has a lot of features that allow you to manipulate data without having to loop through a dataframe. Here is a solution that is quicker and will not be thrown off by missing data points.
Here are the first few rows of our dataframe, weather
First, calculate the date from two weeks ago and create a new column “date_2_weeks_ago”.
Now our weather dataframe looks like this:
Next make a copy of the dataframe and add the suffix of “two weeks ago” to the date and average temperature columns
Here are the first few rows of the new dataframe, weather_2_weeks_ago:
Merge the two data frames together on the “station” column and “date_2_weeks_ago.”
Here are the first few rows of weather (Note null values because we don’t have data for dates in “date_2_weeks_ago” that are not also in the original “date” column, so we will lose some observations.)
Here are the last few rows of weather, which do have data because the date in “date_2_weeks_ago” is also in the original “date” column of the dataframe
Now you have both the average temperature for the original date and the temperature from two weeks before that date!