Pandas, Fast and Slow

How to 1000x pandas performance when adding multiple columns

Eric Ness
When I Work Data
5 min readMay 13, 2019

--

Photo by Chris Liverani on Unsplash

Context

The Care Team at When I Work resolves an incredible number of support tickets every day. While this is great for our customers, processing the tickets into our data lake presents a challenge. The processing job transforms 250k tickets at a time so there is no room for slow steps in our process. One lesson learned at great cost is the performance of different techniques to append multiple columns to a pandas.DataFrame.

Problem Statement

Adding a single column to a DataFrame is a straight-forward operation in pandas. assign adds a column to a DataFrame using modern pandas style. In order to add multiple columns simultaneously custom functions need to be used.

One method is elegant and slow and the other method is ugly and fast. We split a string column into multiple columns using both methods to illustrate the difference.

The string that is split has four randomly-generated UUIDs concatenated together with double colons. An example looks like:

Code is available in Github to generate a DataFrame with a column of these UUID hierarchies. The two methods split these hierarchies at the double colons and add each component to a DataFrame as a separate column.

Slow

One way to add multiple columns at once is with apply. This method turns each row of the DataFrame into a Series and then passes it to a custom function for transformation. The function adds additional entries to the Series and returns it. Finally, all of the transformed rows are combined back into a DataFrame with additional columns. Here is code that will perform the process:

This is the output from the script:

Output of add_multi_with_apply.py

Let’s take a little tour of this code. Using apply with axis="columns" will apply a function to each row of records. This passes each row to append_hierarchy_levels in the form of a pandas.Series. The column names from the DataFrame change to an index entry and the row data change into the Series values. The function adds four more entries to the index so that when updated_ticket returns these extra index entries transform back into column names. Finally, all these transformed rows are concatenated to create a new DataFrame.

Profiling

It is easy to get the overall run time for this script. It would be even more useful to see how much time append_hierarchy_levels takes to process each row. Fortunately, Python has a profiler that will do exactly that. We invoke cProfile on the script using the command:

This times each function in the script and outputs the results to the file add_multi_with_apply.profile. Then use the following command to interactively view the results:

Inside the profiler’s interactive mode, the command stats append_hierarchy_levels gives the results for this function.

There are a few key numbers to look at. ncalls is the number of times the function is called. Here ncalls is 10001 which corresponds to one call for each row of the DataFrame. cumtime is the cumulative time spent inside the function which in this case is 45.29 seconds. Looking near the top of the output, the total script takes 48.737 seconds so almost all of the time is spent in this function. Finally, percall is the amount of time that each function call took. percall is 0.005 seconds or 5 milliseconds.

Take a step further down the stack and look at what part of append_hierarchy_levels took the most time. The command callees append_hierarchy_levels shows how much time each function called inside append_hierarchy_levels took.

The function that used the most time is __setitem__ . This function is called when setting a new index in the Series. The cumulative time for all __setitem__ calls is 43.758 seconds which is nearly the entire run time for the script.

It’s unlikely that any method updating a Series will perform well since setting a new index is so slow.

Fast

There is another approach to this problem which seems counter-intuitive. It involves four times the number of calls to a custom function, but performs orders of magnitude better. The method leverages assign instead of relying on apply. Here is the code and it’s output:

Here get_hierarchy_level is doing the transformation work. This function takes the hierarchy string as an input parameter instead of processing the record as a Series. It returns a single level of the hierarchy to be assigned to a column in transformed_records. This method requires get_hierarchy_level to be called four times more than append_hierarchy_levels in the slow method. Common sense says that this method takes more time. Let’s look at the profiler output to see what the actual performance is.

In the profiler’s interactive mode, use stats get_hierarchy_level to look at the performance of this function.

This is a huge performance boost over the previous method! The previous method cumtime is 45.29 seconds and the same metric for this method is 0.035 seconds. That’s over 1000 times faster.

Conclusion

The clear winner for adding multiple columns to a DataFrame is to use assign instead of apply. This is true even if multiple assign calls need to be made to replace a single apply call.

Going back to the Customer Care Ticket problem at When I Work, we reduced the data processing time from over 4 hours to less than 10 minutes using this technique alone! This turned an unwieldy processing job that would soak up time and resources into a manageable job that can easily fit into the daily ETL pipeline.

References

--

--

Eric Ness
When I Work Data

Principal Machine Learning Engineer at C.H.Robinson, a Fortune 250 supply chain company.