Image source

Part 1. Phenomenal Python Polars: The new go-to for ETL and data analysis

Polars is a fast and easy-to-learn library for ETL and data analysis. Makes connecting to databases really easy and fast.

Arkimetrix Analytics
4 min readJan 15, 2023

--

Polars offers several advantages, including fast data ingestion and export, no memory constraints, and optimized caching and performance. If you haven’t checked it out yet, there are a number of articles on Medium and elsewhere about how quick it is against benchmarks (Pandas) etc., so I will not repeat the details. Here is a link to an excellent introduction to Polars

The documentation and resources may not be as extensive as Pandas, but there is an extremely helpful community, including the library’s author, available for support.

This article covers one first step: how to load the data. It is important to first decide whether you want to ‘Lazy’ or ‘Eager’ load the data. Lazy versus Eager is a large topic in itself, and I will not attempt to discuss that here.

There is so much that can be done with Polars and as indicated above, a lot of introductory articles and other reviews are available.

This article covers the very first step, how to read data. It is important to first decide whether you want to ‘Lazy’ or ‘Eager’ load the data.

We’d prefer to lazy load the data if possible. As Polars is so quick, we have never had to worry about load times versus fully loading the data in memory. However, the Lazy API documentation is sparse and at times doesn’t translate 1:1 from the eager expressions.

Polars supports reading data from various formats (CSV, Parquet, and JSON) and connecting to databases like Postgres, MySQL, and Redshift. Reading or ‘scanning’ data from CSV, Parquet, JSON etc. is quite straightforward and well covered in the usage guide.

Connecting to most databases is usually simple as well, but currently there are some gaps. First, the easy one and it works beautifully straight out of the box: connecting to Postgres (or MySQL, Postgres, Sqlite, Redshift, Clickhouse).

How to connect to Postgres using Polars

Polars uses and requires the connector-x library to be installed. If this saves you time, here’s the script we use:import polars as pl

dialect = 'postgresql'
username = 'username'
password = 'password'
host = 'host'
port = 'port' #usually 5432
database = 'database'
table = 'table'
partition_on = 'quantity' # Optional: a numeric field in the source table
num_part = 12 # Optional: number of parallel operations to run
def connect_download(tablename,partition_on,num_part):
conn = f"{dialect}://{username}:{password}@{host}:{port}/{database}"
query = 'SELECT * FROM public'+tablename
df = pl.read_sql(query, conn,partition_on = partition_on,\
partition_num=num_part )
#if not defining the partition, comment line above and use one below
#df = pl.read_sql(query, conn)

# alternatively, you can read in parallel using 2 threads
# by explicitly providing two SQL queries
#queries = [
# "SELECT * FROM lineitem WHERE partition_col <= 10",
# "SELECT * FROM lineitem WHERE partition_col > 10",
#]
#df = pl.read_sql(queries, conn)
#modify as required if you want to materialize the data frame
df.write_parquet(f'{vv.data_path}/{database}_{table}.parquet')

Polars allows data to be materialized in many different formats, including CSV, JSON etc.

Polars (or more accurately, connector-x) currently doesn’t support SQL Server. We do need to connect to SQL Server very often, and still use Pandas or alternatives, depending on the data size.

Regardless, we still prefer to use Polars and clone the pandas data frame using the Polars function polars.from_pandas() .

This isn’t related to Polars, but if you need to connect to SQL Server, here’s a script we use. Make sure you have the MS SQL Server Driver installed correctly, in our experience it tends to be an adventure (at least on Linux). You will also need pyodbc.

How to connect to SQL Server (Polars workaround)

import pyodbc
from sqlalchemy import create_engine
from sqlalchemy.sql import text
import pandas as pd
from tenacity import retry, wait_exponential, stop_after_attempt
import os
from dotenv import load_dotenv
load_dotenv()

server = r'00.000.0.0\Abc123'
database = 'database'
username = os.environ.get('DBUSER') #replace with username if not using .env
password = os.environ.get('PASSWORD') #replace with password if not using .env
driver=r'ODBC Driver 18 for SQL Server'

# Function to materialize sql output to the specified location
def sql_download(tablename,output_loc):
''' Function connect to SQL Server and fetches the required table
Pequires following arguments:
(SQL Server) Username, Password, tablename, output path
requires following libraries: os, dotenv, pyodbc, pandas, tenacity, pyarrow, sqlalchemy'''
DATABASE_CONNECTION = f'mssql+pyodbc://{username}:{password}@{server}/{database}?driver={driver}&TrustServerCertificate=yes'
engine = create_engine(DATABASE_CONNECTION)
connection = engine.connect()

# This section is optional. Use to monitor or log
# Use `exponential backoff` to poll the database for a status
# and make sure that it is available for a connection
# by sending a basic query at an incremental delay.
# The number of retry attempts as well as the multiplier
# and time between attempts can be modified as required

@retry(wait=wait_exponential(multiplier=2, min=1, max=10),\
stop=stop_after_attempt(5))
def try_connection():
try:
with connection:
stmt = text("SELECT 1")
connection.execute(stmt)
print("Connection to database successful.")

except Exception as e:
print("Connection to database failed, retrying.")
raise Exception

try_connection()

# This section is required to create the connection object
# generated by the `Engine.connect()` method and close automatically
# by `Connection.close()` at the end of the codeblock

connection = engine.connect()
with connection:
sql = ('select * from '+tablename)
df = pd.read_sql_query(sql, connection)
df = pl.from_pandas(df) #convert pandas df to Polars
df.write_parquet(output_loc) #replace with required output format

# download the query to a df and materialized as parquet file
sql_download('table_name','file_path\file_name.parquet')

Final thoughts

A recommendation for anyone new to data engineering or data analysis would be to skip Pandas and start with Polars. My team has been using the Polars library for the past few months and it has been a huge productivity boost for all our projects. It is fast becoming our ETL tool of choice, both for the speed but especially for the simplicity and the power of Polars Expressions. More in the next article.

--

--

Arkimetrix Analytics

Arkimetrix Analytics turns raw data into insights via Power BI and Python, streamlining decision-making and operations for clients.