The Usain Bolt of Data Processing, Pandas Lag Behind

Fareed Khan
6 min readSep 3, 2023

--

In the world of working with data, Python is a popular language because it can do a lot of different things, and it has a bunch of extra tools that can help. When we work with data, we often need to do things quickly, especially when the data is really big and complicated.

Imagine you have a big puzzle made of data pieces. Python is like a tool that helps you solve the puzzle. But sometimes, when the puzzle gets really big, Python can take a long time to finish.

That’s where Polars comes in. Polars is like a special tool that’s really good at solving big data puzzles quickly.

Polar Official Documentation — link

All the provided examples, code snippets, and visuals have been sourced from DataCamp.

What Makes Polars Special?

  1. Fast and Speedy — Polars is designed to work really fast. It can solve big data puzzles much quicker than regular Python tools.
  2. Easy to Use — Polars is easy to learn and use. If you already know Python, you won’t have a hard time learning how to use Polars.
  3. Helpful — Polars has lots of helpful things you can use to work with your data. You can do things like sorting, grouping, and putting data together.
  4. Smart Thinking — Polars is smart. It looks at your problems and figures out the best way to solve them quickly and without using too much computer memory.

Why Choose Polars Over Regular Python Tools (Pandas)?

Regular Python tools like pandas are like regular tools you use at home. They can be slow when the puzzle is very big. Polars is like a super-powered tool that can finish the big puzzles much faster.

So, if you’re dealing with big and tricky data puzzles and you want to get the answers quickly, Polars might be the right tool for you. Stick with us, and we’ll show you how to use Polars to make your data work a whole lot easier.

An image depicting the comparison between Polars and Pandas shown below:

Pandas vs. Polars — A Simple Comparison

So, you’ve learned about the theory, and now it’s time to dive into the practical side of things. In the upcoming sections, we’ll run some tests to compare how fast Pandas and Polars can perform various tasks. Our measure of success will be the time it takes to complete each task. To do this, we’ll use the “time” module, which helps us measure how long things take.

In addition to performance, we’ll also look at how compatible the code is between Polars and Pandas. In other words, we’ll see how much code you need to change if you want to switch from Pandas to Polars.

Dataset Description

We’re going to test Pandas and Polars using a made-up dataset. This dataset contains information about hourly sales for a company with offices in different countries, spanning from 1980 to 2022. To make things a bit more challenging, we’ve added some missing data here and there. The dataset is quite big, with over 22 million rows and more than 1.6 GB of memory usage. It’s a tough challenge for Pandas! You can find all the code for the benchmark in this DataCamp Workspace

Import Test

Let’s start by comparing how Pandas and Polars perform when importing data. Polars boasts about being great at importing data from CSV files. Is that really the case?

We’ll test how long it takes for both Pandas and Polars to read specific columns and rows. For example, imagine you only want to analyze sales data for the France office. With Pandas, you would typically read all the rows and then filter out the unwanted ones using the new “.query()” method. For Pandas 2.0, we’ll import the CSV using both the traditional NumPy engine and the new PyArrow engine.

On the other hand, Polars provides the “filter()” method, which lets us read only the rows we’re interested in.

# Pandas query
df_pd = pd.read_csv("mydata.csv", engine="pyarrow")
df_pd = df_pd[['id', 'date', 'office', 'sales']]
df_pd = df_pd.query("office=='France'")

# Polars filter
df_pl = pl.read_csv('example.csv').filter(
(pl.col('office') == 'France'))
df_pl.select(pl.col(['id', 'date', 'office', 'sales']))

Running this test, Polars outperforms Pandas by a significant margin. Interestingly, the PyArrow engine doesn’t provide better results than the standard NumPy engine.

Group By Test

Grouping data is a great candidate for parallelization. Let’s see how Pandas and Polars handle group-by operations with aggregated calculations. We’ll calculate the mean and median of sales by office and month.

# Pandas groupby mean
df_pd.groupby([df_pd.office, df_pd.date.dt.month])['sales'].agg('mean')

# Polars groupby mean
df_pl.groupby([pl.col('office'), pl.col('date').dt.month()]).agg([pl.mean('sales')])

The results show that, in group-by and aggregation calculations, Polars performs better than Pandas. The syntax of Polars is quite similar to Pandas.

Rolling Statistics Test

Another interesting operation to test is rolling statistics. This involves complex calculations, and optimizing them is a challenge. Let’s see how Pandas and Polars handle calculating a rolling mean of sales by day.

# Pandas rolling mean
df_pd['sales'].rolling(1440, min_periods=1).mean()

# Polars rolling mean
df_pl['sales'].rolling_mean(1440, min_periods=1)

Once again, Polars comes out as the winner. There’s a small syntax difference: Polars has the “rolling_mean()” method, while Pandas uses the “rolling()” method combined with other functions like “mean()”.

Sampling Test

Sampling is a common task in statistics and data science. It involves taking random samples from a dataset to analyze it. This can be challenging if you don’t have a lot of computing power. Let’s see how Pandas and Polars handle bootstrapping, a resampling technique with replacement. We’ll estimate the average of sales in 10,000 samples, each comprising 1,000 data points.

# Pandas bootstrap
simu_weights = []
for i in range(10000):
bootstrap_sample = random.choices(df_pd['sales'], k=1000)
simu_weights.append(np.nanmean(bootstrap_sample))

# Polars bootstrap
simu_weights = []
for i in range(10000):
bootstrap_sample = df_pl['sales'].sample(n=1000, with_replacement=True)
simu_weights.append(bootstrap_sample.mean())

Polars is nearly five times faster than Pandas in this sampling test.

Compound Manipulations Test

In final test, a series of connected data manipulations task are performed. This is something data professionals do regularly. We’ll join our dataset with another dataset called “italy_2022” and then sort and select data to identify the top interns based on sales in 2022.

# Pandas compound operations
df_pd.merge(italy_2022, on='id').query("responsibility =='Sales Intern'").sort_values('sales', ascending=False).head(10)[['name','surname','sales','sex']]

# Polars compound manipulations
df_pl.join(italy_2022_pl, on="id").filter(pl.col('responsibility') == 'Sales Intern').sort('sales', descending=True).head(10).select(pl.col(['name','surname','sales','sex']))

once again, Polars outperforms Pandas in terms of speed, especially in the join operation. The syntax is similar, but there are slight differences due to how the libraries work internally.

Comparison Table

Here’s a summary of all the tests in a table for a quick comparison:

In conclusion, according to our tests, Polars performs considerably better than Pandas in almost all aspects. However, Pandas still retains the advantage of having a larger user community and being the go-to option for data manipulation in Python. If you’re working with large datasets and need to boost your performance, Polars is a powerful alternative worth considering.

--

--