Q-strings for better SQL through Python — Part 1

7 min readApr 30, 2024

Meet the Q-string. Yes, Q is for “query”.

Like many people drawn to python from non-software backgrounds, there was a time when I did not care much for SQL. It smelled of crummy keyboards and fax machines and CRT monitors. We’re in the 21st century! Python is so much cooler! I’d write the simplest SELECT * commands to get as much data you can as into pandas so I get that antigravity feeling as soon as possible. I’ve changed, but over the years discovered that many people in our field still operate this way. I won’t expound on why this is not optimal, though you’re certainly welcome to do so in the comments!

df = pd.read_sql(select_star_from_sometable, connection)

If you’ve typed many variations and wrappers of this line, you’ve likely faced frustration. You’ve come to the right place. Whether you need yet another encouragement to pick up more SQL or you conjure up nested CTEs with window functions before breakfast, this article series seeks to be The Guide for mixing SQL and python in a functionally and stylistically sensible way, for better code and improved developer experience.

Across dozens of projects, I found myself wishing for a similar set of tools that we’ll implement in a single python class over the course of the next few articles. Mix and match the code artifacts according to what features you find useful!

SELECT contents FROM table

see what I did there?

Part 1. Basic Q-string (this article).

1.1. The query string class..
1.2. Timing and results.
1.3. Error handling.
1.4. Data provenance.

Part 2. Templated queries.

2.1. Queries with variables.
2.2. Queries from files.
2.3. The dbt way (and why you don’t need to worry about it).

Part 3. Misc features for convenience and safety

3.1. Check query for unsafe statements
3.2. Pre-flight row count
3.3. Fun with ULIDs

Meet the Q-string.

Preparations

We’ll load some data (a table of atomic weights and isotopic compositions of chemical elements) into a Pandas dataframe and use DuckDB to query that dataframe directly. See? You’ve already learned something, and are well on your way to elevating your SQL game!

import duckdb
import pandas as pd
pd.options.display.max_colwidth=128

url = "https://raw.githubusercontent.com/liquidcarbon/chembiodata/main/isotopes.csv"
isotopes = pd.read_csv(url)
assert isotopes.shape == (354, 4)
assert duckdb.sql("SELECT * FROM isotopes").shape == isotopes.shape

Feature scope.

What’s a good, quick way to write and run readable queries?
If the query worked, tell me the result shape, and how long it took.
If the query did not work, tell me exactly what happened, without scrolling through kilometers of traceback to see that there was a syntax error.
If a dataframe came from a SQL query, what was that query?

None of this is at all hard — but it will be awkward to implement until we realize that it is the star of the show here is the string itself. All that the various libraries do is pass a string to a SQL engine. But there’s quite a lot more we can do with a string! Recall that a python string is an object, and objects have methods (functions like .replace), and methods can have arguments… To endow our humble string with superpowers, we can subclass strand create any number of methods on that class. In the simplest form, the class definition for a Q-string is three lines of code:

class Q(str):
    def run(self):
        return duckdb.sql(self)

For readability, it’s best to put the Q-strings in string literals (triple quotes). This way, SQL is on separate lines. Multi-line queries can be properly indented for readability and easily copy-pasted to share with a friend. The instance q of class Q is still a string. You can print it, split it, replace characters, do regex tricks, and…it also runs SQL! Item 1 done. Antigravity!

q = Q("""
SELECT 42
""")
q.run()

#  Returns:
# ┌───────┐
# │  42   │
# │ int32 │
# ├───────┤
# │    42 │
# └───────┘

Note: DuckDB’s python client can return results as a plain text table, Pandas, Polars, Arrow, or Numpy objects. The default, as shown here, is ASCII table.

To return a Pandas dataframe, let’s add one more method .df() that would call the previously defined .run() and return a dataframe. Do we need both? Yes, because .run() will be used for SQL statements such as CREATE and DROP. These commands do not return anything. The dataframe will come from a dedicated method.

import duckdb
import pandas as pd

class Q(str):
    """Query string for running DuckDB SQL."""
    def run(self):
        return duckdb.sql(self)
    def df(self) -> pd.DataFrame:
        return self.run().df()

Q("""
CREATE OR REPLACE TABLE halogens AS (
    SELECT * FROM isotopes
    WHERE Number IN (9, 17, 35, 53, 85)
)
""").run()

Q("""
SELECT * FROM halogens
""").df()

# Returns:
# ┌─────────┬────────┬────────────┬───────────┐
# │ Symbol  │ Number │    Mass    │ Abundance │
# │ varchar │ int64  │   double   │  double   │
# ├─────────┼────────┼────────────┼───────────┤
# │ F       │      9 │  18.998403 │       1.0 │
# │ Cl      │     17 │  34.968853 │    0.7576 │
# │ Cl      │     17 │  36.965903 │    0.2424 │
# │ Br      │     35 │  78.918338 │    0.5069 │
# │ Br      │     35 │   80.91629 │    0.4931 │
# │ I       │     53 │ 126.904472 │       1.0 │
# │ At      │     85 │ 209.987148 │       0.0 │
# │ At      │     85 │ 210.987497 │       0.0 │
# └─────────┴────────┴────────────┴───────────┘

Let’s handle item 2: result shape (number of rows and columns) and execution time. Of the many possible ways to do this, let’s get fancy and write a decorator. Python decorators modify or extend the behavior of a function without changing its implementation, which is exactly what we’re after here, and cleanly separate the two code blocks. The decorators can be reused on other functions, and you don’t need to worry about tweaking the whole code base should you decide later to touch up the decorations.

import duckdb
import pandas as pd
from time import time

class Q(str):
    """Query string with timer and results."""
    def timer(func):
        def wrapper(self, *args, **kwargs):
            self.start = time()
            result = func(self, *args, **kwargs)
            self.time = round(time() - self.start, 4)
            return result
        return wrapper

    @timer
    def run(self):
        return duckdb.sql(self)

    def df(self) -> pd.DataFrame:
        result = self.run()
        result_df = result.df()
        self.rows, self.cols = result.shape
        msg = f"{self.rows} rows x {self.cols} cols in {self.time} sec"
        print(msg)  # TODO: change to log.info(msg)
        return result_df

q = Q("""
SELECT Symbol, COUNT(*) AS n_isotopes
FROM isotopes
GROUP BY Symbol
ORDER BY 2 DESC
LIMIT 4
""")
q.run()

# Returns:
# 4 rows x 2 cols in 0.0016 sec
# ┌─────────┬────────────┐
# │ Symbol  │ n_isotopes │
# │ varchar │   int64    │
# ├─────────┼────────────┤
# │ Sn      │         10 │
# │ Xe      │          9 │
# │ Cd      │          8 │
# │ Te      │          8 │
# └─────────┴────────────┘

Note: I’m using print to show query stats for simplicity. In real code, please use logs. Logs are your friends. I usually use loguru.

What if we make a SQL syntax error?

This traceback isn’t too bad, but in real applications, it goes on and on through the chain of functions falling like dominoes due to bad SQL. All you need is the last line. So let’s figure out exception handling, item 3 on our feature list.

At this stage it’s quite hard to arrive at a universally useful solution, without considering the bigger picture. For exploration in a notebook, simple try-except will do. But if you’re developing an application, a data pipeline, really anything meant to be used by more than you alone, you’ll have to make some compromises. At a minimum, it’s a good idea to keep the promise of returning a dataframe, so that we don’t worry about this part downstream.

def df(self) -> pd.DataFrame:
    ...

In practice, we’ll put a try-except in .run() where exceptreturns a dictionary that describes the error. A dict is easily turned into a pandas frame or any other tabular format we’d like.

import duckdb
import pandas as pd
from time import time

class Q(str):
    """Query string with timer, results, and exception handling."""
    def timer(func):
        def wrapper(self, *args, **kwargs):
            self.start = time()
            result = func(self, *args, **kwargs)
            self.time = round(time() - self.start, 4)
            return result
        return wrapper

    @timer
    def run(self):
        try:
            return duckdb.sql(self)
        except Exception as e:
            # log.error(e)
            return {"error": [str(e), repr(type(e))]}
    
    def df(self) -> pd.DataFrame:
        result = self.run()
        if isinstance(result, dict) and "error" in result:
            result_df = pd.DataFrame(result, index=["message", "type"])
        else:
            result_df = result.df()
            self.rows, self.cols = result_df.shape
            msg = f"{self.rows} rows x {self.cols} cols in {self.time} sec"
            print(msg)  # TODO: change to log.info(msg)
            result_df.q = self
        return result_df

After: neat, informative description of the problem

Although “50 lines of code” declared in the headline is a silly constraint, it’s nice to see we only needed 30 to create a pretty powerful tool. The last feature, already in the final code above, took just one line (immediately before return result_df):

result_df.q = self

This creates a new attribute on the resulting dataframe that stores the entire Q-string instance, unlocking a bunch of tools related to data provenance and audits:

df = Q("""
SELECT * FROM isotopes WHERE Name LIKE 'A%'
""").df()
# 13 rows x 4 cols in 0.0027 sec

print(df.q)  # print query associated with a dataframe
df.q.df()    # run query again (did data change?)
# still 13 rows x 4 cols

# not implemented here - run same query against another db:
# df.q.df(engine=another_engine)
# pd.read_sql(df.q, another_connection)

In this post, we saw how placing the string at the center of attention provides a number of handy tools that make SQL+Python union even more delightful. Even though I’ve written a dozen variations of the Q-string for different projects, I still discovered new solutions while writing this post. In the future articles, we’ll further extend the Q class to allow programmatic construction of queries and other features for safety and convenience. Stay tuned, and let me know what features you found useful!

Follow SELECT * INTO Pay It Forward for original content on data supply chain and personal growth, connect with me on LinkedIn, or join SelectStars group on Telegram.