Pandas, Fast and Slow
How to 1000x pandas performance when adding multiple columns
Context
The Care Team at When I Work resolves an incredible number of support tickets every day. While this is great for our customers, processing the tickets into our data lake presents a challenge. The processing job transforms 250k tickets at a time so there is no room for slow steps in our process. One lesson learned at great cost is the performance of different techniques to append multiple columns to a pandas.DataFrame
.
Problem Statement
Adding a single column to a DataFrame
is a straight-forward operation in pandas
. assign
adds a column to a DataFrame
using modern pandas
style. In order to add multiple columns simultaneously custom functions need to be used.
One method is elegant and slow and the other method is ugly and fast. We split a string column into multiple columns using both methods to illustrate the difference.
The string that is split has four randomly-generated UUIDs concatenated together with double colons. An example looks like:
f3c02a24-dafb-4294-bb88-0bd26c011568::50bd097d-375a-42d7-a04d-2331534e4f9f::7ad921e2-538d-4ef6-b03c-b49f75549946::73b19fac-9ae7-49af-bb32-49d2da8f2ca8
Code is available in Github to generate a DataFrame
with a column of these UUID hierarchies. The two methods split these hierarchies at the double colons and add each component to a DataFrame
as a separate column.
Slow
One way to add multiple columns at once is with apply
. This method turns each row of the DataFrame
into a Series
and then passes it to a custom function for transformation. The function adds additional entries to the Series
and returns it. Finally, all of the transformed rows are combined back into a DataFrame
with additional columns. Here is code that will perform the process:
This is the output from the script:
Let’s take a little tour of this code. Using apply
with axis="columns"
will apply a function to each row of records
. This passes each row to append_hierarchy_levels
in the form of a pandas.Series
. The column names from the DataFrame
change to an index entry and the row data change into the Series
values. The function adds four more entries to the index so that when updated_ticket
returns these extra index entries transform back into column names. Finally, all these transformed rows are concatenated to create a new DataFrame
.
Profiling
It is easy to get the overall run time for this script. It would be even more useful to see how much time append_hierarchy_levels
takes to process each row. Fortunately, Python has a profiler that will do exactly that. We invoke cProfile
on the script using the command:
python -m cProfile -o add_multi_with_apply.profile add_multi_with_apply.py
This times each function in the script and outputs the results to the file add_multi_with_apply.profile
. Then use the following command to interactively view the results:
python -m pstats add_multi_with_apply.profile
Inside the profiler’s interactive mode, the command stats append_hierarchy_levels
gives the results for this function.
There are a few key numbers to look at. ncalls
is the number of times the function is called. Here ncalls
is 10001 which corresponds to one call for each row of the DataFrame
. cumtime
is the cumulative time spent inside the function which in this case is 45.29 seconds. Looking near the top of the output, the total script takes 48.737 seconds so almost all of the time is spent in this function. Finally, percall
is the amount of time that each function call took. percall
is 0.005 seconds or 5 milliseconds.
Take a step further down the stack and look at what part of append_hierarchy_levels
took the most time. The command callees append_hierarchy_levels
shows how much time each function called inside append_hierarchy_levels
took.
The function that used the most time is __setitem__
. This function is called when setting a new index in the Series
. The cumulative time for all __setitem__
calls is 43.758 seconds which is nearly the entire run time for the script.
It’s unlikely that any method updating a Series
will perform well since setting a new index is so slow.
Fast
There is another approach to this problem which seems counter-intuitive. It involves four times the number of calls to a custom function, but performs orders of magnitude better. The method leverages assign
instead of relying on apply
. Here is the code and it’s output:
Here get_hierarchy_level
is doing the transformation work. This function takes the hierarchy string as an input parameter instead of processing the record as a Series
. It returns a single level of the hierarchy to be assigned to a column in transformed_records
. This method requires get_hierarchy_level
to be called four times more than append_hierarchy_levels
in the slow method. Common sense says that this method takes more time. Let’s look at the profiler output to see what the actual performance is.
python -m cProfile -o add_multi_with_assign.profile add_multi_with_assign.pypython -m pstats add_multi_with_assign.profile
In the profiler’s interactive mode, use stats get_hierarchy_level
to look at the performance of this function.
This is a huge performance boost over the previous method! The previous method cumtime
is 45.29 seconds and the same metric for this method is 0.035 seconds. That’s over 1000 times faster.
Conclusion
The clear winner for adding multiple columns to a DataFrame
is to use assign
instead of apply
. This is true even if multiple assign
calls need to be made to replace a single apply
call.
Going back to the Customer Care Ticket problem at When I Work, we reduced the data processing time from over 4 hours to less than 10 minutes using this technique alone! This turned an unwieldy processing job that would soak up time and resources into a manageable job that can easily fit into the daily ETL pipeline.