Swifter 1.0.0: automatically efficient pandas and modin dataframe apply operations

Jason Carpenter
5 min readOct 12, 2020

--

Since it’s conception over two years ago, the swifter algorithm has enabled performant pandas applies of user-defined functions (UDFs) for thousands of python programmers.

The algorithm leverages a combination of attempting to execute the quick solution (vectorization), estimation and extrapolation, and validation of output to ensure that dataframe applies are swift and accurate.

The impact made by the swifter algorithm is best captured by the following performance benchmark.

Swifter apply performance benchmark log-log plot. Small dataframe applies make use of linear complexity applies with minimal overhead and perform at near-pandas speed. Large dataframe applies leverage Dask multiprocessing and converge to Dask performance seamlessly without any effort from the user.

Due to swifter’s algorithmic approach, dataframe applies automatically converge to either serial or multiprocessing dependent on the size of the data and complexity of the function applied.

However, this isn’t even the entire story. Prior to deciding whether to leverage Dask for multiprocessing or revert to a simple pandas apply, swifter first attempts simply vectorizing the operation. This attempt can massively enhance performance of dataframe applies, but requires that the user structures the function in a way that can be vectorized. I will walk through how to structure your apply function to enable vectorization and parallelization in the Best Practices section below.

Swifter is not only a performant means of applying UDFs to dataframes, but it is easy to use and comes with a progress bar. Incorporating the swifter algorithm is as simple as follows:

import pandas as pd
import swifter
df = pd.DataFrame(...)df.swifter.apply(...)

All we have to do is add .swifter to the command chain and now the dataframe has access to swifter’s efficient apply methods.

Version 1.0.0

Swifter recently released version 1.0.0, which now includes the following efficient apply functionalities for pandas dataframes:

  • df.swifter.apply(...)
  • df.swifter.applymap(...)
  • df.swifter.rolling(...).apply(...)
  • df.swifter.resample(...).apply(...)
Swifter 1.0.0 introduces compatibility with Modin dataframes.

And not only that, I am pleased to announce that version 1.0.0 also includes compatibility with Modin dataframes. If you’re not familiar with Modin dataframes, check them out here. Modin dataframe’s homepage concisely explains their value proposition: “Scale your pandas workflow by changing one line of code.” It’s as simple as changing your import statement.

import modin.pandas as pd
import swifter
df = pd.DataFrame(...)df.swifter.apply(...)

This new compatibility between swifter and Modin now allows users of Modin to continue scaling pandas workflows by only changing the import statement, even when the code leverages swifter for performance enhancements.

Swifter apply performance benchmark log-log plot for Modin dataframes. Modin dataframe apply is already performant, but swifter can further improve performance of Modin dataframes by leveraging vectorization.

It is worth noting that because Modin dataframes are already distributed and performant, swifter’s primary value-add to Modin dataframes is the vectorization apply attempt before reverting to Modin applies. Though this is the current state, tighter swifter~modin integration and performance enhancing capabilities are in the works. Stay tuned for more on this soon.

Latest improvements

As swifter continues to build out API compatibility with various pandas and modin apply methods, in parallel we have managed to solve longstanding issues with the library.

The most critical shortcoming that users of swifter have historically had to deal with is related to string processing. Swifter was largely unable to improve performance for dataframe columns that included strings. Until now, that is.

In recent versions of Dask DataFrame, this string processing shortcoming was resolved. Over several string processing functions, we see approximately 2x speed-up over pandas applies.

By leveraging Modin dataframes for swifter dataframe applies which target string dtype columns, swifter now significantly improves performance over pandas when applying to text data.

Previously, users of swifter could use the method df.swifter.allow_dask_on_strings(True).apply(...) approach to force swifter to use Dask as an attempt at improving performance over pandas applies. Since Dask didn’t used to improve performance in all cases, this was not the default option.

Now, it is no longer necessary to allow_dask_on_strings because the default apply to string data is to use Modin, which consistently improves performance over pandas for string processing. Users of swifter are still able to use the allow_dask_on_strings method to force swifter to use Dask instead of Modin for string processing, if desired.

Swifter performance benchmark string processing plot (actual scale). Dask apply was formerly less performant than pandas for processing string dtype columns. Now, Dask apply improves performance by 2x. New default Modin apply reduces string processing time by an additional 25% over Dask apply.

Best practices

The swifter apply examples notebook indicates how to leverage swifter most effectively, but I find that the key to unlocking swifter’s potential is sometimes closer than we realize. As I mentioned in the beginning, before determining whether to use the base dataframe apply or leverage parallelism to increase performance, swifter first attempts to vectorize the operation.

Though vectorization is only possible in a subset of apply functions, it is important to know when this is possible. More generally, it is critical to know how to structure function code such that we can fully unlock swifter’s performance enhancing algorithm. A simple example follows, with setup:

import pandas as pd
import numpy as np
import swifter
df = pd.DataFrame({
"gaussian": np.random.normal(size=1_000_000),
"str_date": ["2020-01-01"] * 1_000_000,
})

Functions are vectorizable, if simple and direct enough. For example, we can square every number in a column.

df.gaussian.swifter.apply(lambda x: x**2)

Or we can convert strings to datetimes.

df.str_date.swifter.apply(pd.to_datetime)

But notice how these functions have no conditionality in them. They are simple in the sense that they are applying a specific data manipulation to the entire column. These direct applies of the operation can be done in a vectorized fashion because there is no control-flow involved.

As we add control-flow to our user-defined functions we can end up preventing the function from being vectorized, or even improve performance at all. The below function will not only fail to leverage vectorization, but due to the if-else statement it will not even utilize parallelization. Instead, it will revert to a pandas apply.

def square_or_sqrt(x):
return x**2 if x > 0 else x**(1/2)
df.gaussian.swifter.apply(square_or_sqrt)
Pandas Apply: 100%|██████████████████████████████| 1000000/1000000 [00:00<00:00, 1051074.93it/s]

So how can we solve this? Well, one way is to use np.where instead of if-else control-flow statements. With this reformulation of the function we can now at least leverage automatic parallelization with Dask via swifter.

def square_or_sqrt(x):
return np.where(x > 0, x**2, x**(1/2))
df.gaussian.swifter.apply(square_or_sqrt)
Dask Apply: 100%|█████████████████████████████████████████| 24/24 [00:00<00:00, 47.32it/s]

Understanding how the structure of the function you wish to apply to the dataframe can impact swifter’s ability to improve performance is essential to unlocking swifter’s full potential.

References

Swifter GitHub: https://github.com/jmcarpenter2/swifter

Swifter Docs: Documentation + Changelog

Swifter Examples: Apply examples + Performance Benchmark

Modin: GitHub + Documentation

Dask: GitHub + Documentation

Contributor Contact: Jason Carpenter

LinkedIn: https://www.linkedin.com/in/jasonmcarpenter/

GitHub: https://github.com/jmcarpenter2

--

--