Part 2: Efficient Data Manipulation with Python Polars: Lazy Frames, Table Combining and Deduplication

Deliciously Simple Recipes to Kickstart Your Data Transformations

7 min readJan 16, 2023

In this article, the second in a series of practical Polars recipes, we will explore some of the transformations we have used in our recent projects.

In the previous article, we discussed our fondness for Polars and provided code on how to connect to supported databases such as Postgres, as well as a workaround for connecting to SQL Server.

In this article, we will cover the following:

Create a fake data set and it save as a parquet file
Lazy (or eager) load the data into a Polars data frame
Learn how to combine multiple data frames
Remove duplicate rows from a data frame

To begin, we will create a fake dataset to demonstrate practical transformations. This dataset is intentionally slightly more complex than the usual toy data frame to reflect real-life complexity.

import random
from datetime import datetime
import polars as pl


customers = []
LETTERS = 'abcdefghijklmnopqrstuvwxyz'


def create_fake_data(nr_of_rows, path, file):
    '''Create a fake transactions dataset and materialize to defined format.
    Dataframe can be made as large as required using required argument 'nr_of_row'
    Other required arguments are 'path' and 'filename' '''

    for customers_id in range(nr_of_rows):

        # Create transaction date in range
        d1 = datetime.strptime('1/1/2021', '%m/%d/%Y')
        d2 = datetime.strptime('12/31/2021', '%m/%d/%Y')

        # get the timestamp of the beginning and end of the range
        start_timestamp = int(d1.timestamp())
        end_timestamp = int(d2.timestamp())

        # generate a random timestamp within the range.
        random_timestamp = random.uniform(start_timestamp, end_timestamp)
        transaction_date = datetime.fromtimestamp(random_timestamp)
        transaction_date = transaction_date.strftime('%m/%d/%Y %H:%M:%S.%f')

        # create customer IDs
        cust_id = random.randint(1, 135)

        # Create shipping method
        ship_method = random.choice(['Truck', 'Air', 'Rail'])

        # Create Item #s
        item = random.randint(1, 1000)

        # create reference numbers
        random_letters = ''.join(random.sample(LETTERS, k=3))
        random_numbers = ''.join(str(random.randint(0, 9)) for _ in range(7))
        ref = f'id-{random_numbers}{random_letters}'+';' + \
            f'id-{random_numbers[::-1]}{random_letters}'

        # create list with the generated elements
        customers.append([transaction_date, ref, cust_id, ship_method, item])

    # covert list to dataframe using Polars
    df = pl.DataFrame(customers,
                      columns=['Transaction_date', 'Ref',
                               'Cust_ID', 'Ship_method', 'Item'])

    #print(df)  # debug

    # write file to desired format, Parquet used in this example.
    # The arguments 'use-arrow' and 'compression' are optional,
    # but we found that specifying values resulted in faster write speeds
    df.write_parquet(f'{path}/{file}.parquet',
                     use_pyarrow=True, compression='zstd')

start_time = time.time() #optional: record start time

create_fake_data(
    10000000, '/path', 'output_file_name')

#optional: record execution time
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")

The size of the dataframe can be adjusted using the ‘nr_of_rows’ argument. In the examples provided, the number of rows is set to 1000000 (10⁶). Benchmarking is performed locally on a laptop with a 12th Gen Intel i7–1260P CPU (2100 MHz, 12 Cores), 16 GB RAM, and Ubuntu 20.04 (WSL2) operating system. To generate 10⁶ rows takes ~26 seconds. The parquet file is ~24.5GB. The function will return a data frame like this:

shape: (1000000, 5)
┌────────────────────────────┬─────────────────────────────┬─────────┬─────────────┬──────┐
│ Transaction_date           ┆ Ref                         ┆ Cust_ID ┆ Ship_method ┆ Item │
│ ---                        ┆ ---                         ┆ ---     ┆ ---         ┆ ---  │
│ str                        ┆ str                         ┆ i64     ┆ str         ┆ i64  │
╞════════════════════════════╪═════════════════════════════╪═════════╪═════════════╪══════╡
│ 10/15/2021 21:10:17.889427 ┆ id-1094579svb;id-9754901svb ┆ 87      ┆ Truck       ┆ 969  │
│ 01/03/2021 23:00:20.705979 ┆ id-6778571yva;id-1758776yva ┆ 57      ┆ Rail        ┆ 463  │
│ 10/25/2021 08:21:38.892811 ┆ id-9868666djm;id-6668689djm ┆ 40      ┆ Rail        ┆ 442  │
│ 12/24/2021 10:24:11.919099 ┆ id-8161638iyl;id-8361618iyl ┆ 37      ┆ Rail        ┆ 42   │
│ ...                        ┆ ...                         ┆ ...     ┆ ...         ┆ ...  │
│ 04/13/2021 19:50:16.876760 ┆ id-7819853vwc;id-3589187vwc ┆ 14      ┆ Rail        ┆ 626  │
│ 11/07/2021 06:54:43.312792 ┆ id-5154523ukt;id-3254515ukt ┆ 41      ┆ Truck       ┆ 798  │
│ 07/16/2021 10:36:44.143773 ┆ id-1350711rpi;id-1170531rpi ┆ 15      ┆ Air         ┆ 349  │
│ 04/11/2021 20:08:08.716959 ┆ id-7493362bni;id-2633947bni ┆ 38      ┆ Rail        ┆ 942  │
└────────────────────────────┴─────────────────────────────┴─────────┴─────────────┴──────┘

Lazy loading data

Let’s create a dataframe by ‘Lazy Loading’ the parquet file from the previous step into a Polars dataframe.

“Polars allows you to scan a Parquet input. Scanning delays the actual parsing of the file and instead returns a lazy computation holder called a LazyFrame.”

Execution time to load 1 million row into a dataframe: 0.00879 seconds.

import polars as pl
import time

df = pl.scan_parquet("path.parquet")

The data can also be read ‘eagerly’ from the parquet file into memory:

# replace pl.scan_parquet with read_parquet
df = pl.read_parquet("path.parquet)

Execution time in this case is : 0.0840 seconds, 10 times longer than the lazy load, but still lightning fast.

A little about LazyFrame Objects

Several libraries such as Dask, Vaex and Modin provide Lazy DataFrame functionality.

Lazy DataFrames are a concept in Python that refers to a way of working with data in a DataFrame format, but without loading all the data into memory at once. A LazyFrame holds information found during the brief scan and reads the rest of the file only when it is needed.

This can be useful when working with large datasets that would otherwise consume too much memory to fit into RAM. Instead of loading the entire dataset into memory, Lazy DataFrames only load and process the data as it is needed. This allows for more efficient use of resources and can make it possible to work with much larger datasets than would otherwise be possible.

Some downsides of using Lazy DataFrames include:
Increased complexity: Adds an additional layer of complexity to the process of working with data, as the data is not loaded into memory all at once. This can make it more difficult to understand and debug issues that may arise.
Reduced performance for small datasets: Lazy DataFrames may not provide any performance benefits for small datasets, and in some cases, may even be slower than working with regular DataFrames due to the overhead of managing the lazy operations.
Incompatibility with certain libraries: Some libraries and tools are not designed to work with Lazy DataFrames, which can limit the functionality that is available when working with the data.
Limitations on operations: Some operations that can be performed on regular DataFrames cannot be performed on Lazy DataFrames.

The Lazy frames can not be simply viewed or printed, and if you try, you will get output like this:

df
Out[25]: <polars.LazyFrame object at 0x7FF963E705B0>

print(df)
naive plan: (run LazyFrame.describe_optimized_plan() to see the optimized plan)

  PARQUET SCAN /path/fake_data.parquet
  PROJECT */5 COLUMNS

Some useful commands to get details about a Lazy Frame:

DataFrame.Schema: returns a dict with column names and DataType

df.schema
Out[28]: 
{'Transaction_date': Utf8,
 'Ref': Utf8,
 'Cust_ID': Int64,
 'Ship_method': Utf8,
 'Item': Int64}

LazyFrame.collect Polars will eagerly evaluate the query by calling the .collect method. To run a query in eager mode you can also just replace scan_xxx with read_xxx in the Polars code.

df.collect()
Out: 
shape: (1000000, 5)
┌────────────────────────────┬─────────────────────────────┬─────────┬─────────────┬──────┐
│ Transaction_date           ┆ Ref                         ┆ Cust_ID ┆ Ship_method ┆ Item │
│ ---                        ┆ ---                         ┆ ---     ┆ ---         ┆ ---  │
│ str                        ┆ str                         ┆ i64     ┆ str         ┆ i64  │
╞════════════════════════════╪═════════════════════════════╪═════════╪═════════════╪══════╡
│ 10/15/2021 21:10:17.889427 ┆ id-1094579svb;id-9754901svb ┆ 87      ┆ Truck       ┆ 969  │
│ 01/03/2021 23:00:20.705979 ┆ id-6778571yva;id-1758776yva ┆ 57      ┆ Rail        ┆ 463  │
│ 10/25/2021 08:21:38.892811 ┆ id-9868666djm;id-6668689djm ┆ 40      ┆ Rail        ┆ 442  │
│ 12/24/2021 10:24:11.919099 ┆ id-8161638iyl;id-8361618iyl ┆ 37      ┆ Rail        ┆ 42   │
│ ...                        ┆ ...                         ┆ ...     ┆ ...         ┆ ...  │

LazyFrame.fetch Collect a small number of rows for debugging purposes. Fetch is like the collect() operation, but it overwrites the number of rows read by every scan operation.

df.fetch(2)
Out: 
shape: (2, 5)
┌────────────────────────────┬─────────────────────────────┬─────────┬─────────────┬──────┐
│ Transaction_date           ┆ Ref                         ┆ Cust_ID ┆ Ship_method ┆ Item │
│ ---                        ┆ ---                         ┆ ---     ┆ ---         ┆ ---  │
│ str                        ┆ str                         ┆ i64     ┆ str         ┆ i64  │
╞════════════════════════════╪═════════════════════════════╪═════════╪═════════════╪══════╡
│ 10/15/2021 21:10:17.889427 ┆ id-1094579svb;id-9754901svb ┆ 87      ┆ Truck       ┆ 969  │
│ 01/03/2021 23:00:20.705979 ┆ id-6778571yva;id-1758776yva ┆ 57      ┆ Rail        ┆ 463  │
└────────────────────────────┴─────────────────────────────┴─────────┴─────────────

Vertical concatenation — let’s make the dataset larger

Polars allows both vertical and horizontal concatenation of two or more dataframes. Horizontal concatenation will be covered another time, but for now let’s double the size of dataframe from the previous section.

Why? Why not :). Data analyts have to often join tables and this is an extremely fast way to do that. Of course, Vertical concatenation works only when the dataframes have the same column names.

# continuing from the previous section where df has been loaded
# the syntax is the same whether the df is a scan or read
# Lazy dfs returns lazy df

df_vertical_concat = pl.concat([
    df,
    df
])

# df_vertical_concat Out: shape: (2000000, 5)

Lazy data frame took only ~2.5 seconds. Of course, concatenation of in-memory data frames (using read_parquet instead of scan_parquet) took less time 0.0636 seconds. Just for kicks, concatenating it ten times to create a 10 million row data frame took 1.001 seconds.

Remove Duplicates

In the previous step we duplicated each row of the orginal dataframe. Let’s get it back into the original form. Easy!

df_back = df_vertical_concat.unique()
#Execution time: 0.0010294914245605469 seconds

In Conclusion

This article has provided an overview of some basic actions that can be achieved using Python Polars. The article demonstrated how to create a fake dataset that reflects real-life complexity, and provided examples of how some transformations can be applied.

As a reminder, this is part of a series and these recipes are currently somewhat random. Subsequent articles will cover more detailed transformations.

Hope that this series will informative and useful for the Python Polars community, and that it inspires further exploration and experimentation with this powerful tool.