Beyond Pandas — working with big(ger) data more efficiently using Polars and Parquet

Jack Vines
Data science at Nesta
7 min readJan 30, 2023

As data scientists/engineers, we often deal with large datasets that can be challenging to work with. Pandas, a popular Python package for data manipulation, is great for small- to medium-sized datasets, but it can become slow and resource-intensive when working with larger datasets.

In this article, we will discuss how using the Python package Polars and the Parquet file format can help improve the efficiency and scalability of your data workflow. We will cover the benefits of using these tools and provide step-by-step instructions on how to implement them. Whether you’re a beginner or an experienced data scientist/engineer, this article will hopefully provide insights on how to optimise your workflow and work with larger datasets more effectively.

Note that although the tools we cover will help maximise the performance of your existing computing resources, the limitations of those resources will still exist!

Polars

Polars is a library for processing large datasets in Python. It provides many of the same features as the Pandas library.

One key advantage of Polars is that it supports lazy evaluation, meaning that the library optimises queries on the dataset before it begins executing them and only loads and processes the data when it is needed. Also, Polars is written in Rust, a language that supports multiprocessing by default.

Pandas, by comparison, does not natively support multiprocessing and is incapable of lazy evaluation, which leads to wasting of resources and longer execution times.

Parquet

The Parquet file format is a columnar storage format that is designed for efficient storage and data transfer. It is commonly used in the big data ecosystem and it is supported by a wide range of tools and systems including Apache Spark and Amazon S3.

One of the key advantages of the Parquet file format is its ability to handle large datasets. It stores data in a highly compressed and efficient manner which reduces the amount of storage and bandwidth required. It also supports advanced features such as data partitioning, which can be useful when working with complex data structures and changing requirements.

In comparison, the comma-separated values (CSV) file format is a simple, flat text file format that stores data in a row-based manner. It is widely supported and easy to use but it can be inefficient when working with large datasets due to its lack of compression and advanced features. It is also more prone to data loss and corruption due to its lack of error checking and recovery mechanisms. On the other hand, it’s a much more accessible format for non-coders due to its compatibility with spreadsheet software, so may still be a better choice for outputs.

A workflow combination of Polars and Parquet can be several orders faster than Pandas and CSV and is also very fast when compared to other Python libraries designed to solve similar problems, such as Vaex and Dask (as per figure 1 below, from TPCH speed benchmarking, showing read times of various libraries, with Polars being the fastest).

Figure 1: Benchmark speed comparisons between Polars, Dask, Pandas, Modin and Vaex (source: Polars TPCH benchmark)

Polars also sets out further benchmarking information and results.

How to use Polars and the Parquet file format in your data science workflow

Let’s dive into how to quickly implement these tools to speed up your data workflow. One of my favourite features of Polars is how syntactically similar it is to Pandas, allowing very quick transition from Pandas code to Polars code, when Pandas performance becomes an issue. Additionally, both Polars and Pandas come with the ability to work with Parquet files, so they can be integrated into your workflow without the need for additional tools.

We’re going to use the Daily Power Grid Dataset from Kaggle for a worked example, which isn’t a particularly big dataset, but is still indicative of the speed improvements that can be made by making use of Polars and parquet.

Using Polars

To start using Polars, you will need to install it in your Python environment. You can do this using the following command:

pip install polars

Once you have installed Polars, you can begin using it to manipulate and analyse your data. Here is an example of how to load and preview a dataset using Polars:

import polars as pl
df = pl.read_csv("~/Downloads/archive/PowerGeneration.csv")
df.head()

A simple comparison to pandas shows significant differences in performance. On a dataset this small the difference may feel negligible, but for datasets that are multiple gigabytes in size, reading with Polars over Pandas can reduce waiting times from minutes to seconds:

Polars’ speed improvements are not just limited to reading files, but also for processing data frames. In the following example, I want to group the power generation dataset by the Power Station variable, calculating the sum of the Excess(+) / Shortfall (-) variable, then sorting from highest to lowest values. Here’s how to do it in Pandas and in Polars, and the difference between the two in terms of speed:

Polars has many data frame manipulation functions similar to comparable libraries such as Pandas. See here for examples in the documentation of more methods in Polars.

A small difference to be aware of between Pandas and Polars is that Polars does not have an index, each row is indexed by its location in the table. Arguably this simplifies things, but if coming from Pandas, be aware that iloc won’t exist!

Additionally, you may get to a stage where you’ve utilised Polars to do some heavy reading/processing/aggregating data, but are at a point where it would be quite useful to continue your workflow in Pandas. In this instance, if you have a Polars dataframe df, you can make use of the to_pandas() functionality:

df = df.to_pandas()

Even if all you do with Polars is read the dataset, then immediately convert to a pandas dataframe, it’s likely quicker than a direct read with pandas:

Although exact improvements will vary depending on the specific use case, in this example, the resulting data frame in both instances is a pandas.DataFrame, but reading it via Polars is 5 times faster.

Using the Parquet file format

Whether using Pandas or Polars, it’s straightforward to interact with Parquet files, and reading files is pretty much the same with either library:

import polars as pl
import pandas as pd
# polars
df = pl.read_parquet("my_dataset.parquet")
# pandas
df = pd.read_parquet("my_dataset.parquet")

Parquet files will usually be smaller than CSV files containing the same information, because of the efficient compression, and these files tend to be much quicker to read and write in comparison to other file types.

Although most data scientists will by default opt to use CSV files for storage of data, the small change of using Parquet files over CSV will lead to savings in storage space, and quicker load/save times in your workflow.

In the example below, I’m comparing writing and reading the power generation dataset, firstly from a Pandas data frame to a CSV file, and back to a Pandas data frame, and secondly from a Polars data frame to a Parquet file and back to a Polars data frame:

The combination of Polars and Parquet in this instance results in a ~30x speed increase!

Conclusion

In this article, we looked at how the Python package Polars and the Parquet file format can help improve the efficiency and scalability of a workflow when working with large datasets. We went over the differences between these tools and alternatives like Pandas and the CSV file format and provided step-by-step instructions on how to use them in your workflow.

Overall, if you’re a data scientist dealing with large datasets, it’s worth considering adding Polars and Parquet to your toolkit. These tools can make working with large datasets much more manageable, and can help you scale up your workflow as your datasets grow. Give them a try and see how they can benefit your data projects.

Further reading

This tutorial covers a very basic approach to using Polars and Parquet format in your workflow, but both tools are very powerful beyond the simple approach outlined above. To dive into advanced usage of Polars and Parquet files in a less Pandas/CSV adjacent way, consider checking out the following resources:

--

--