Python Polars Cookbook = POOK BOOK

Part 3. Python Polars - Data Type Operations, Replacements, Extractions, and Explosions

Easier Syntax and Faster than Pandas, Dask, Vaex etc.

8 min readJan 20, 2023

I want to add some qualifications at the outset . While the title suggests that Polars is both easier to use and faster than competitors, the article itself does not provide proof of that. The claim about speed is supported by a multitude of articles on the topic and our own tests, some covered in Part 2 of the series. The claim about easier syntax is my opinion and my team’s experience using Polars, but the article does not compare and contrast Polars versus the rest.

This series of articles are targeted at analysts who are getting started with Python, and might be useful for Python experts not familiar with the Python Polars library. It is focussed on Data Type Operations defined in the next section.

The first article covered why Polars is phenomenal (as a replacement for Pandas, Dask etc.). Part 2 covers some basic operations on how to read and write data etc.

This article examines frequently used transformations for data analysis or modeling after data extraction, as part of an ongoing series. Part 3 covers:

Data type transforms
Replace strings and drop columns
Split or Explode strings

We will continue to use the the dataframe created in Part 2. The fake dataframe ‘df’ has the following structure:

df.fetch(5)
Out[]: 
shape: (5, 5)
┌────────────────────────────┬─────────────────────────────┬─────────┬─────────────┬──────┐
│ Transaction_date           ┆ Ref                         ┆ Cust_ID ┆ Ship_method ┆ Item │
│ ---                        ┆ ---                         ┆ ---     ┆ ---         ┆ ---  │
│ str                        ┆ str                         ┆ i64     ┆ str         ┆ i64  │
╞════════════════════════════╪═════════════════════════════╪═════════╪═════════════╪══════╡
│ 10/15/2021 21:10:17.889427 ┆ id-1094579svb;id-9754901svb ┆ 87      ┆ Truck       ┆ 969  │
│ 01/03/2021 23:00:20.705979 ┆ id-6778571yva;id-1758776yva ┆ 57      ┆ Rail        ┆ 463  │
│ 10/25/2021 08:21:38.892811 ┆ id-9868666djm;id-6668689djm ┆ 40      ┆ Rail        ┆ 442  │
│ 12/24/2021 10:24:11.919099 ┆ id-8161638iyl;id-8361618iyl ┆ 37      ┆ Rail        ┆ 42   │
│ 03/12/2021 16:18:25.103390 ┆ id-8669336nsx;id-6339668nsx ┆ 13      ┆ Air         ┆ 27   │
└────────────────────────────┴─────────────────────────────┴─────────┴─────────────┴──────┘

We are also using ‘Lazy evaluation’ throughout, since we are using the ‘scan_’ function to load the parquet file to the df dataframe. Execution can be requested at any time by the .collect method. So, for example: df = df.collect() would query all available data.

Data type conversions

Column renaming

Before we get started with the transformations, your boss/customer complains that the ‘Transaction_date’ is too long and needs to be shortened to ‘date, and ‘Ref’ should instead be ‘reference’. Easily done with a single line:

# Rename columns
df = df.rename({"Transaction_date": "date",'Ref':'reference'})

Data types: Int to Str to Int

Polars data types can be found in the documentation.

To change datatypes, Polars uses the expression ‘polars.Expr.cast’. So for example, let’s say you want the ‘Cust_ID’ columns to be string format instead of i64, you can use the expression below.

# Illustrative: Change column datatypes 
df = df.with_columns([
    pl.col('Cust_ID').cast(pl.Utf8), #from 'i64' to string ('Utf8')
    pl.col('Cust_ID').cast(pl.Int64) #change back to 'i64'
])

Data type String to Date

It is almost certain that date operations will be required on df. As you can see above, the ‘Transaction_date’ column (since renamed to ‘date’) is in string format. Let’s change that to date format, but you can’t just use the expression ‘polars.Expr.cast’ to do that.

Here is a script to illustrate how that would return an error:

# illustrative, useful for multipe columns
convert_to_date = ['Transaction_date']

''' This block of code is wrong, here just to illustrate possible errors.'''
    
df = (
    df.with_columns([
      pl.col(convert_to_date).cast(pl.Date()).alias("Date_Transact")
      ])
      )

'''Converting a string (utf8 data type in Polars) to date will fail
    and return an error of type: ComputeError: Strict conversion from Utf8
    to Date failed for values ["12/24/2021 10:24:11.919099", 
    0/25/2021 08:21:38.892811", ... "01/03/2021 23:00:20.705979"]...
    Ideally should have been fixed before saving to parquet file'''

To convert a string to date, we need to use the ‘strptime’ function as below:

''' Right way to convert string to date'''
df = (
    df.with_columns([
        pl.col(convert_to_date).str.strptime(
            pl.Date,
            format='%m/%d/%Y %H:%M:%s.%f', strict=False
        )
    ])
)

If you are not familiar, the strptime() method is a Python function from the datetime module. It parses a string representation of a date and time and convert it to a datetime object. This method takes two arguments: a string representing the date and time (pl.Date), and a string representing the format of the date and time (fmt).

For example: from df, the first date (string) = “10/15/2021 21:10:17.889427”. The date format here is “%m/%d/%Y %H:%M:%s.%f” (the string has time down to microsceonds, represented by %f. Here is a good article that covers the various Python strptime() date time objects.

Parsing column values

For the next operation, let’s look at the ‘reference’ column (originally ‘Ref’). The ‘reference’ column holds string values of the type ‘id-1094579svb;id-9754901svb’.

Let’s say a business SME has just informed you that the column actually holds two different items that are separated by “;”. They would also like the prefix “id-” to be removed.

Let’s do the transformations one step at a time.

Replace

First, let’s get rid of the prefix ‘id-’

df = df.with_columns([
    pl.col('reference').str.replace(r"id-", "")
])

The script above returns:

df.fetch(5) # returns first 5 rows from the Lazy frame
Out[]: 
shape: (5, 5)
┌────────────┬──────────────────────────┬─────────┬─────────────┬──────┐
│ date       ┆ reference                ┆ Cust_ID ┆ Ship_method ┆ Item │
│ ---        ┆ ---                      ┆ ---     ┆ ---         ┆ ---  │
│ date       ┆ str                      ┆ i64     ┆ str         ┆ i64  │
╞════════════╪══════════════════════════╪═════════╪═════════════╪══════╡
│ 2021-10-15 ┆ 1094579svb;id-9754901svb ┆ 87      ┆ Truck       ┆ 969  │
│ 2021-01-03 ┆ 6778571yva;id-1758776yva ┆ 57      ┆ Rail        ┆ 463  │
│ 2021-10-25 ┆ 9868666djm;id-6668689djm ┆ 40      ┆ Rail        ┆ 442  │
│ 2021-12-24 ┆ 8161638iyl;id-8361618iyl ┆ 37      ┆ Rail        ┆ 42   │
│ 2021-03-12 ┆ 8669336nsx;id-6339668nsx ┆ 13      ┆ Air         ┆ 27   │
└────────────┴──────────────────────────┴─────────┴─────────────┴──────┘

Here you see only the first instance of ‘id-’ was removed, but we want both instanced gone!

Easy, replace ‘str.replace’ with ‘str.replace_all’

df = df.with_columns([
    pl.col('reference').str.replace_all(r"id-", "")
])

df.fetch(5)
Out[]: 
shape: (5, 5)
┌────────────┬───────────────────────┬─────────┬─────────────┬──────┐
│ date       ┆ reference             ┆ Cust_ID ┆ Ship_method ┆ Item │
│ ---        ┆ ---                   ┆ ---     ┆ ---         ┆ ---  │
│ date       ┆ str                   ┆ i64     ┆ str         ┆ i64  │
╞════════════╪═══════════════════════╪═════════╪═════════════╪══════╡
│ 2021-10-15 ┆ 1094579svb;9754901svb ┆ 87      ┆ Truck       ┆ 969  │
│ 2021-01-03 ┆ 6778571yva;1758776yva ┆ 57      ┆ Rail        ┆ 463  │
│ 2021-10-25 ┆ 9868666djm;6668689djm ┆ 40      ┆ Rail        ┆ 442  │
│ 2021-12-24 ┆ 8161638iyl;8361618iyl ┆ 37      ┆ Rail        ┆ 42   │
│ 2021-03-12 ┆ 8669336nsx;6339668nsx ┆ 13      ┆ Air         ┆ 27   │
└────────────┴───────────────────────┴─────────┴─────────────┴──────┘

Split, Split_exact, Drop and Explode

For the second step, we want to split the ‘reference’ strings that are separated by ‘;’ into two parts .

df = df.with_columns(
    [
        pl.col('reference').str.split(by=";")
    ]
)

This returns the ‘reference’ column with a list of strings. Not exactly what we want.

df.fetch(5)
Out[]: 
shape: (5, 5)
┌────────────┬──────────────────────────────┬─────────┬─────────────┬──────┐
│ date       ┆ reference                    ┆ Cust_ID ┆ Ship_method ┆ Item │
│ ---        ┆ ---                          ┆ ---     ┆ ---         ┆ ---  │
│ date       ┆ list[str]                    ┆ i64     ┆ str         ┆ i64  │
╞════════════╪══════════════════════════════╪═════════╪═════════════╪══════╡
│ 2021-10-15 ┆ ["1094579svb", "9754901svb"] ┆ 87      ┆ Truck       ┆ 969  │
│ 2021-01-03 ┆ ["6778571yva", "1758776yva"] ┆ 57      ┆ Rail        ┆ 463  │
│ 2021-10-25 ┆ ["9868666djm", "6668689djm"] ┆ 40      ┆ Rail        ┆ 442  │
│ 2021-12-24 ┆ ["8161638iyl", "8361618iyl"] ┆ 37      ┆ Rail        ┆ 42   │
│ 2021-03-12 ┆ ["8669336nsx", "6339668nsx"] ┆ 13      ┆ Air         ┆ 27   │
└────────────┴──────────────────────────────┴─────────┴─────────────┴──────┘

There are two ways to solve the problem depending on what output is required.

Case 1. Split into columns: We want to split the string values in ‘reference’ column into exactly 2 parts and assign each part to a new column. To do that, instead of ‘str.split’ we can use ‘str.split_exact’.

df = df.with_columns(
    [
        pl.col('reference').str.split_exact(";", 1)
        .struct.rename_fields(["first_part", "second_part"])
        .alias("fields"),
    ]
).unnest("fields").drop('reference')

This returns what we wanted, two new columns and we used the df.drop(‘col name’) method to remove the original ‘reference’ column.

df.fetch(5)
Out[]: 
shape: (5, 6)
┌────────────┬─────────┬─────────────┬──────┬────────────┬─────────────┐
│ date       ┆ Cust_ID ┆ Ship_method ┆ Item ┆ first_part ┆ second_part │
│ ---        ┆ ---     ┆ ---         ┆ ---  ┆ ---        ┆ ---         │
│ date       ┆ i64     ┆ str         ┆ i64  ┆ str        ┆ str         │
╞════════════╪═════════╪═════════════╪══════╪════════════╪═════════════╡
│ 2021-10-15 ┆ 87      ┆ Truck       ┆ 969  ┆ 1094579svb ┆ 9754901svb  │
│ 2021-01-03 ┆ 57      ┆ Rail        ┆ 463  ┆ 6778571yva ┆ 1758776yva  │
│ 2021-10-25 ┆ 40      ┆ Rail        ┆ 442  ┆ 9868666djm ┆ 6668689djm  │
│ 2021-12-24 ┆ 37      ┆ Rail        ┆ 42   ┆ 8161638iyl ┆ 8361618iyl  │
│ 2021-03-12 ┆ 13      ┆ Air         ┆ 27   ┆ 8669336nsx ┆ 6339668nsx  │
└────────────┴─────────┴─────────────┴──────┴────────────┴─────────────┘

Case 2. Split into rows: We want to split the string values in ‘reference’ column into exactly 2 parts, but want them to remain in the same column.

This can be accomplished using polars.Expr.explode

df = (df
      .with_columns(
          pl.col('reference').str.split(by=";")
      )
      .explode(pl.col('reference'))
      )

This returns a long dataframe where every split item is expanded to a new row.

df.fetch(4)
Out[]: 
shape: (8, 5)
┌────────────┬────────────┬─────────┬─────────────┬──────┐
│ date       ┆ reference  ┆ Cust_ID ┆ Ship_method ┆ Item │
│ ---        ┆ ---        ┆ ---     ┆ ---         ┆ ---  │
│ date       ┆ str        ┆ i64     ┆ str         ┆ i64  │
╞════════════╪════════════╪═════════╪═════════════╪══════╡
│ 2021-10-15 ┆ 1094579svb ┆ 87      ┆ Truck       ┆ 969  │
│ 2021-10-15 ┆ 9754901svb ┆ 87      ┆ Truck       ┆ 969  │
│ 2021-01-03 ┆ 6778571yva ┆ 57      ┆ Rail        ┆ 463  │
│ 2021-01-03 ┆ 1758776yva ┆ 57      ┆ Rail        ┆ 463  │
│ 2021-10-25 ┆ 9868666djm ┆ 40      ┆ Rail        ┆ 442  │
│ 2021-10-25 ┆ 6668689djm ┆ 40      ┆ Rail        ┆ 442  │
│ 2021-12-24 ┆ 8161638iyl ┆ 37      ┆ Rail        ┆ 42   │
│ 2021-12-24 ┆ 8361618iyl ┆ 37      ┆ Rail        ┆ 42   │
└────────────┴────────────┴─────────┴─────────────┴──────┘

Note, we see 8 rows even though we specified df.fetch(4). This is because Polars is still working with the original Lazy Frame, and is not just returning 4 rows, it is returning 4 partitions of the dataframe and each partition has 2 rows.

An aside: Readers somewhat familiar with the Polars library might have noticed that the .lazy() method was not used in any of the examples.

The .lazy() method is used in the Polars library to indicate that the computation should be performed lazily, rather than eagerly. This means that the computation is not executed immediately, but instead is delayed until the result is needed. Not the expert on this, but it appears that functions that do not need the .lazy() method are for data transformation rather than computation.

Conclusion and Recap

Part 3 has covered some transformations that can be done easily and quickly using Python Polars. The different scripts above can be combined into one, of course. It would look something like this:

import polars as pl
import time

df = pl.scan_parquet(
    '/home/dhb/pythonscripts/final_scripts/Database/fake_data.parquet')

# illustrative, useful for multipe columns
convert_to_date = ['Transaction_date']

# case 1: split reference into two columns
df = (
    df
    .with_columns([
        pl.col(convert_to_date).str.strptime(
            pl.Date,
            fmt='%m/%d/%Y %H:%M:%s.%f', strict=False
        )
    ])
    .rename({'Transaction_date': 'date', 'Ref': 'reference'})
    .with_columns([
        pl.col('reference').str.replace_all(r"id-", ""),
        pl.col('reference').str.split_exact(";", 1)
        .struct.rename_fields(["first_part", "second_part"])
        .alias("fields")
    ])
    .unnest("fields").drop('reference')
)

Do clap if you found this article useful.