Run-Time Optimization in Pandas

6 min readJun 17, 2016

There are many times in a programmer’s life that they hack out some code, run it, and are disappointed by the run time. They are then forced to go back to the drawing board and evaluate their approach. Times like these are perfect opportunities for knowledge growth. This is one of those times.

Before you try to slog through this explanation, let me show you the results so you see the value of reading through the rest of this. The following chart shows the run-time to process X days worth of data (I explain what the data looks like in the next section), and which method we used to process the data. If you’re not familiar with the Pandas Python toolkit, I’d recommend looking into it. For now, let’s just say gives us a way to easily manipulate large datasets in Python.

To give you an overview of the problem we’re tackling, let me first show you a basic form of the dataset:

From this you can see an item can belong to many sections, and a section can contain many items. Something that is not in this graphic, but also important to keep in mind, is that in a given day there are many items.

Let’s say during a given day, 10 instances of Item 1 were sold. Our goal is to try to predict the distribution of sales in the sections it belongs to. For instance, we may predict 6 instances were sold in Section A, 3 in Section B, and 1 in Section C.

So here we hit our first problem — We have to predict the weight of each section. For the sake of this post, we’re going to ignore how we go about doing that and just assume we can calculate the weights semi-accurately.

Now we can go about calculating the predicted amount sold in a section for an item on a particular day via the following:

You can see we re-normalize the weights for a given section since not all items belong to all sections.

Now for implementing it. Since our team is well-versed in Python and Pandas is our bread-and-butter, we decided to roll with that. (Haha butter-roll puns)

The implementation resembled the following:

Awesome. We have the initial working code. To give a sense of the size of data we’re dealing with, we’re processing about 60M rows spread out over the past 15 days. Also, the weight dataframe is dated, since the particular weight for a section can (and probably should) change slightly every day.

We ran this, and measured a single metric — the time to process 10k rows. This implementation processed 10k rows in 30 minutes, or just over 3 months for the entire dataset.

Now, given the SLA with our client and the time we get the data in the system, we had at most 4 hours. This clearly wouldn’t work.

After a some line-by-line run time analysis, we saw that the lookup of the df_row took 0.5s. Some light digging into Pandas selections brought us to the conclusion that it was doing a full dataframe scan on 60M rows. So the lookup time made sense, and was clearly the bottleneck.

The next course of action was to eliminate that bottleneck. So we added a column to the dataframe via

Since the combination of location, item, section, and date resulted in a unique row, we simply combined those and were able to sort the dataframe by the result. We then used pandas searchsorted to leverage binary search to find the index of our row. We would then simply select the row by index.

This results in a theoretical run time speedup of O(# Rows) to O(log(# Rows)).

An initial run gave us a speed of 1 second/10k rows, or about 1.5 hours for all 60M rows. At this point we could have packed up and gone home. We technically hit our SLA time.

However, our ETL tool is still quite new and often hiccups about once a week for one reason or another. If the system crapped out and we didn’t catch it until 2 hours before our SLA, we’d be hard-pressed to re-process in time. We could do better than 1.5 hours.

One optimization we could easily make is to segment the 60M row dataframe by date and process the dates in parallel. That proved to be a very simple task, and the code was more or less the following:

Great, now we could theoretically spin up one of those fancy 16-core 122GB RAM servers on AWS and process the entire thing in less than 15 minutes!

So we did just that. We allocated a core to each date, and told ‘er to run as fast as she could. The results: 3 minutes/10k rows, or 10 hours for the entire dataset. That made no sense initially given a single core did it in 1.5 hours and the fact that in our local testing, we achieved 1s/10k rows with parallelization. As we soon found out, we needed a couple cores free for just IO purposes and managing the rest of the system.

When we freed up 2 cores, the world was at peace again. We then achieved the 1s/10k rows we hoped for, and could process all rows in 10 minutes.

However, remember those weights? Yeah, those guys still took an hour to process, and we found no way to speed that up. So we were sitting pretty on a 1 hour 10 minute processing time. Still not the greatest. We had a hunch that due to the simplicity of how we were calculating the weights, it should be much faster. We also thought we had reached the maximum speed Pandas could achieve.

So, after some brainstorming, we decided to try porting it to SQL. It seemed to fit the tools SQL provided us, and didn’t look to be all that much work to do it. Over the course of the next 3 days, we ported it over, making heavy use of window functions. The run time of computing the weights with SQL was about 2 seconds, and the sales distribution computation took about 37ms/10k rows, or 1 minute for the 60M rows. This we could definitely live with.

All in all, here are some of the benefits we gained by porting to SQL:

Increased code readability
Removed about 100 lines of code
Decreased complexity by eliminating side-processes
Saved $120/month in server costs
Ensured we can always meet our SLA

And here are some nifty graphs and tables to illustrate the benefits:

Let me know what you think of this post, and any experience you’ve had with pandas that could contribute to better, more efficient code.

If you enjoyed this and are interested in empowering enterprises with predictive analytics, drop me a line at robert@algomus.com.

Run-Time Optimization in Pandas

Written by Robert Ervin