Sliding Windows in Pandas
Identify Patterns in Time-Series Data with Overlapping Window Techniques
Windowing techniques enable data analysts to identify valuable patterns in time-series data. Sliding windows are particularly powerful because they allow you to spot patterns earlier than other techniques. This is an important feature in situations when making a key decision a few minutes (or seconds) earlier can save you money.
This article will show you 3 ways to perform windowing in pandas
and will discuss the tradeoffs and benefits of each approach. By the end, you’ll be able to create a sliding window visualisation that looks something like this:
Imagine you’re an Uber or Lyft driver. You’re driving around New York waiting for your next ride…and burning expensive gas in the meantime. Not cool. You need to find a way to maximise your chance of getting a new ride quickly. How could you do this?
You could take an experience-driven approach and test out different strategies over a period of time to find out which parts of New York have high demand. You try starting in one neighbourhood the first day, another the second, and so on… But New York City is BIG and this way it’s going to take you a loooong time before you can start to identify patterns with any degree of confidence.
It would be much more efficient to take a data-driven approach. Which is exactly what we’ll be doing in this article.
Let’s get to it 🚀
The Dataset
We’ll be working with 1 hour of rideshare (Uber/Lyft) data from the New York City Taxi and Limousine Commission (TLC) Trip Record* dataset, which is subsampled from the dataset for January 2022. You can download the complete Parquet file for January 2022. This contains data for more than 14.5 million rides (!) To replicate the dataset I’m using in this article, subsample to the first 100K rows. You can also access the data and notebook directly from a dedicated Github repo.
df = pd.read_parquet("fhvhv_tripdata_2022-01.parquet").head(100_000)
Remember: our objective is to identify zones with peaks in demand, defined as number of requests per unit of time.
1. Quick-n-Dirty: a GroupBy over the Entire Dataset
We’ll start by doing a very rough pass to identify high-demand zones over the entire 1-hour dataset.
After using a time_mask
to subset the data for a specific hour and sorting the values by the request_datetime
column, this is essentially just a groupby operation.
# define time mask
time_mask = (df['request_datetime'].dt.month == 1) & \
(df['request_datetime'].dt.day == 1) & \
(df['request_datetime'].dt.hour == 0)
# sort values
one_hour = df[time_mask].sort_values(by=['request_datetime'])
# perform a simple groupby with .count()
high_demand = one_hour[['PULocationID','hvfhs_license_num']].groupby(['PULocationID']).count()
We now have a list of the number of requests per pickup location. Let’s clean up the column names and sort the values to get a meaningful output.
high_demand = high_demand.rename(columns={'hvfhs_license_num':'n_requests'})
high_demand.sort_values(by='n_requests', ascending=False, inplace=True)
high_demand.head(10)
Sweet. We can now identify the top-10 pick-up zones with highest demand during the hour for which we have data. We could run this code in batch once a hour. This is great…but only sort of.
Processing the data in 1-hour batches means that drivers will only receive the signal for a high-demand area at the end of the hour. What if the peak was actually happening during the first 15 or even 5 minutes of that hour? Then there would be absolutely no use in rushing to that part of town only to realise that the peak demand was long over.
2. Tumbling Windows in Pandas
The drivers aren’t happy. We need to do better. Let’s build some code that can chop up our hour of data into tumbling windows of 15 minutes each. This way we will get some more granularity and be able to give the drivers more helpful instructions on when to be in which part of town.
We can use the pandas pivot_table()
and .resample()
methods to do this. We’ll need the index of our DataFrame to be set to a datetime column. Since we’re interested in the number of requests, we’ll use the request_datetime
column.
Note that the dataset is not sorted correctly by incoming requests and includes some trips that were requested before midnight on the year before. So let’s clean those up first.
# subset df to requests from 2022 only
requests = df[['request_datetime', 'on_scene_datetime', 'pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID']][df.request_datetime.dt.year==2022]
# sort by request_datetime
requests.sort_values(by=['request_datetime'], inplace=True)
# set request_datetime as index
requests.set_index('request_datetime', inplace=True)
requests.head()
Our DataFrame now looks like this:
Remember that we are looking to group our dataset into 15-minute windows with the number of requests per pickup zone for each window. This means we want our output to have 4 rows (15 x 4 = 60 minutes) and a column for each PULocationID.
We need a pivot:
# expand df to have a unique column for each PULocationID
pivot = pd.pivot_table(
requests,
values='pickup_datetime',
index=requests.index,
columns=['PULocationID'],
aggfunc='count'
)
pivot.head()
This is a sparse DataFrame with simply a count (1) for each row (request) in the column corresponding to that requests’ pickup zone.
Let’s resample our DataFrame with .sum()
to group this data into 15-minute windows and sum up the counts for each PULocationID:
tumbling = pivo.resample('15min').sum()
tumbling
Nice! We now have the number of requests per pickup zone for each 15-minute window. All that’s left is to define exactly what we mean with “high-demand” — let’s say 40 requests per 15-minute window — and then we can use this DataFrame to filter out the windows and pickup zones that exceed that high-demand threshold.
We can also inspect the data visually to identify the peak-demand zones:
res.resample('15min').sum().plot(legend=False)
This is great…but still only sort of. We’re still missing out on lots of granularity here. We need to wait for each 15-min window to finish before we can aggregate the data. What if the spike happens in the 15-minute window between minute 1 and minute 16 (rather than minute 0 and minute 15)? We will have missed the spike by just a minute but will need to wait 14 minutes before receiving the signal. By that time it may be too late to instruct drivers to that pickup zone.
Do not fear, perfection is near!
3. Sliding Windows in Pandas
Ideally, we want a signal every minute that tells us which zones are experiencing peak demand. We also want to reduce the size of our windows to enable our drivers to react to the incoming data faster. So let’s work with 5-minute windows and define “peak demand” as more than 50 requests per 5-minute window.
We can accomplish this using the pandas .rolling()
method to create sliding windows: 5-minute windows with a 4-minute overlap, so that we can track the rate of incoming requests every minute.
You can create sliding windows in pandas using the .resample()
and .rolling()
methods. Make sure to .resample()
to the size of your desired signal interval instead of the size of your window:
# create sliding windows in pandas
res = pivot.resample(interval_size).sum()
windows = res.rolling(window_size).sum()
Let’s pick it apart step by step.
We want our output to be a series of rows in which each row is a 5-minute window and the columns contain the sum of requests per pickup zone for that 5-minute window. However, instead of starting the second row at the end of the first 5-minute window, we want it to start at the end of the first minute of that first 5-minute window.
Still with me here? If not, maybe this diagram below will help:
What we really want is 56 rows of 5-minutes each, with a 4-minute overlap between each row. This will give us a signal every minute which zones (if any) are getting more than 50 requests per 5 minutes.
To do this, we’ll have to resample our DataFrame first in order to get a single row for every minute:
res_1m = res.resample('1min').sum()
res_1m.head(3)
(If you wanted to get a signal say every second, you would resample to 1s
here.)
Now that every row represents a single minute we can use rolling()
and sum()
to get the sum of the number of requests for every 5 rows (minutes):
windows = res_1m.rolling(5).sum()
windows.head()
😱 Yikes, that’s a lot of NaNs.
But that’s only because .rolling()
by default places the output of the window operation (in this case the sum) at the right edge of the window. This is technically correct since we will only get the first signal at the end of minute 4, when the first window has been completed.
We can drop the first 4 columns if we want to get rid of the first 4 empty rows (minutes):
windows_drop = windows.iloc[4:]
windows_drop.head()
Side note: Forward-Looking Windows
In some cases, you might want to move the values to the left edge to make it more easily readable. If you do so, be careful not to make the mistake of thinking that you will actually have data for window 1 at minute 1 (!) — you’re really just shifting the results over to the first row.
You can use the FixedForwardWindowIndexer
object to place the window output on the left edge and create forward-looking windows:
# create custom window indexer
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=5)
# use indexer to create rolling window
windows_f = res_1m.rolling(window=indexer).sum()
windows_f.head()
The last 4 columns of our DataFrame are now empty (because there is no full 5-minute window available after minute 55) and we can drop them.
# drop the last 14 columns
windows_f_drop = windows.iloc[:45]
windows_f_drop.tail()
This way of using the .rolling()
method is not very well-known and is a little buried in the docs but once you understand how it works it’s pretty easy to use.
Visualise Sliding Windows in Pandas
Let’s visualise our data to identify some patterns and spot the pickup zones with the highest peaks in demand. We’ll work with windows
to make it extra clear that we don’t have any data for the first 5 minutes:
import matplotlib.pyplot as plt
windows.plot(kind='bar', legend=False);
This is a very rough visualisation but even here we can already clearly see that:
- We don’t have any data the first 5 minutes
- There are clear peaks in the rate of pickup requests in specific zones.
From here, we can now define a cutoff/threshold and identify zones with above-threshold requests in specific windows. We can then motivate drivers to go to that zone.
One way to clarify the plot is to remove any columns with values below the threshold. Let’s say we want to identify zones with more than 50 requests per 5-minute window:
signal = windows[(windows > 50)].dropna(axis=1, how=”all”)
signal.plot(figsize=(10,8))
plt.legend(bbox_to_anchor=(1.0,1.0))
plt.show()
There’s much more we could do here. Play around with the code in the notebook to perfect this further.
Sliding Windows in Pandas: Conclusion
This article has introduced you to the powerful capabilities of performing sliding windows on time-series data. By chopping up your dataset into small, overlapping windows you are now able to get valuable data signals at an actionable rate.
Of course, ideally you would be doing all of this in real-time. Watch this space for more on that soon, but in the meantime you might want to check out this real-time monitoring tutorial. Be prepared for a slight context change from New York traffic to website traffic instead ;)
I hope you found this post helpful and engaging. Follow me on LinkedIn to catch my latest content.
Happy data crunching! 👋
*New York City Taxi and Limousine Commission (TLC) Trip Record Data was accessed on March 15, 2023 from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page. License: https://www.nyc.gov/home/terms-of-use.page