Q-strings for better SQL without dbt — Part 2

SELECT * INTO Pay It Forward
6 min readMay 2, 2024

--

No hidden catch, no strings attached.

Let’s face it, a substantial fraction of programming projects are really just a series of string formatting exercises in disguise. Word on the streets is that they’re a major part of the interview process at some organizations that pay really, really well.

SELECT contents FROM table

see what I did there…again?

Part 1. Basic Q-string.

1.1. The query string class
1.2. Timing and results
1.3. Error handling
1.4. Data provenance

Part 2. Templated queries (this article).

2.1. Queries with variables
2.2. Queries from files
2.3. The dbt way (and why you don’t need to worry about it)

Part 3. Misc features for convenience and safety

3.1. Check query for unsafe statements
3.2. Pre-flight row count
3.3. Fun with ULIDs

In this article series, I offer a guided tour through ~50 lines of python code that you might find useful in any data project built on a mix of python and SQL, based on a really simple idea of the Q-string. In part 1, we implemented the title class that inherits from plain python string and endowed with some SQL powers, which are about to be leveled up. Picking up where we left off — features of part 1 are implemented in the code below (run in Colab):

import duckdb
import pandas as pd
from time import time

url = "https://raw.githubusercontent.com/liquidcarbon/chembiodata/main/isotopes.csv"
isotopes = pd.read_csv(url)
assert isotopes.shape == (354, 4)
assert duckdb.sql("SELECT * FROM isotopes").shape == isotopes.shape

class Q(str):
"""Query string with timer, results, and exception handling."""
def timer(func):
def wrapper(self, *args, **kwargs):
self.start = time()
result = func(self, *args, **kwargs)
self.time = round(time() - self.start, 4)
return result
return wrapper

@timer
def run(self):
try:
return duckdb.sql(self)
except Exception as e:
# log.error(e)
return {"error": [str(e), repr(type(e))]}

def df(self) -> pd.DataFrame:
result = self.run()
if isinstance(result, dict) and "error" in result:
result_df = pd.DataFrame(result, index=["message", "type"])
else:
result_df = result.df()
self.rows, self.cols = result_df.shape
msg = f"{self.rows} rows x {self.cols} cols in {self.time} sec"
print(msg) # TODO: change to log.info(msg)
result_df.q = self
return result_df

df = Q("""
SELECT * FROM isotopes WHERE Symbol LIKE 'A%'
""").df()
print(df.q) # get query that produced the dataframe

# Returns:
# 13 rows x 4 cols in 0.0027 sec
# SELECT * FROM isotopes WHERE Symbol LIKE 'A%'

Q-strings with variables

For a quick refresher on ways to inject variables into a string template, consider the game of Mad Libs, where you generate funny stories by filling in the blanks in a template with specified parts of speech. Goes like this:

Mad Libs, the original GenAI. Running half the internet since 1969 BC.

In the simplest form, previously defined variables go into f-strings. I’d call them f-Q-strings but it sounds too much like what you might mutter when woken up in the middle of the night. This “just-in-time” formatting already covers lots of use cases, especially in interactive (notebook) development.

x=42; colname="answer"
Q(f"""
SELECT {x} AS {colname}
""").run()

# Returns:
# ┌────────┐
# │ answer │
# │ int32 │
# ├────────┤
# │ 42 │
# └────────┘

For other tasks, may have a need define a query template first (maybe even in a separate file), and populate it later using .format method. No problem! Let’s implement Q-strings with arguments. Just need to add __init__, with extra arguments. Disappointment awaits:

Same as str(“xyz”, x=42) — won’t work.

Instead, we turn to a lesser-known __new__, which usually hidden in the shadows behind __init__:

class LessBrokenQ(str):
def __new__(cls, string, **kwargs):
instance = Q(string.format(**kwargs))
return instance

assert Q("SELECT 42") == LessBrokenQ("SELECT {x}", x=42)
assert Q == type(LessBrokenQ("SELECT {x}", x=42))

However, we cannot use the same __new__ construct for the Q-string itself. When the instance is created in the third line, a call is made to Q.__new__ , so we end up in an infinite recursion loop (try it yourself). Instead, we'll use the same method from the parent class, which we can call with super().__new__ or str.__new__ as shown:

class Q(str):
"""Query string that reads from templates and runs queries."""

def __new__(cls, string, file=False, **kwargs):
if file:
with open(string, "r") as f:
file_content = f.read()
return cls(file_content, **kwargs)
q_string = str.__new__(cls, string.format(**kwargs))
q_string.template = string
q_string.kwargs = kwargs
return q_string

def timer(func):
def wrapper(self, *args, **kwargs):
self.start = time()
result = func(self, *args, **kwargs)
self.time = round(time() - self.start, 4)
return result
return wrapper

@timer
def run(self):
try:
return duckdb.sql(self)
except Exception as e:
# log.error(e)
return {"error": [str(e), repr(type(e))]}

def df(self) -> pd.DataFrame:
result = self.run()
if isinstance(result, dict) and "error" in result:
result_df = pd.DataFrame(result, index=["message", "type"])
else:
result_df = result.df()
self.rows, self.cols = result_df.shape
msg = f"{self.rows} rows x {self.cols} cols in {self.time} sec"
print(msg) # TODO: change to log.info(msg)
result_df.q = self
return result_df

Queries from files

We also introduced a new argument file that defaults to False in the Q.__new__ constructor. It simply reads text from a file and passes the content into a new instance of itself. Now your queries can reside in separate files containing just the SQL templates, and you can have your SQL syntax highlighting (first world problems) and also a cleaner, more modular python code. Your environment may look something like this:

Running SQL from external files in Jupyter Lab

Pro tip: with smart_open, those external SQL files can live not only on your local drive, but just about anywhere on the internet.

Unit tests

# test template strings
q1 = Q("SELECT {x} AS {colname}", x=42, colname="answer")
assert q1 == "SELECT 42 AS answer"

df1 = pd.DataFrame({"answer": [42]}, dtype="int32")
pd.testing.assert_frame_equal(df1, q1.df())

# test template files
with open("query1.sql", "w") as f:
f.write("""/* query1.sql */
SELECT
Symbol,
Number,
Mass,
Abundance
FROM 'https://raw.githubusercontent.com/liquidcarbon/chembiodata/main/isotopes.csv'
WHERE Mass::INT = {x}
""")

assert "error" in Q("query1.sql", file=True, x="abc").run()
assert Q("query1.sql", file=True, x=42).df().loc[0, "Symbol"] == "Ca"

import pytest
with pytest.raises(FileNotFoundError):
Q("file_does_not_exist.sql", file=True)
with pytest.raises(KeyError):
Q("query1.sql", file=True)

The last two tests ensure that a Q-string can’t be created from a non-existent file, or from a template without providing all the keywords. We chose not to handle these errors here in the Q class itself. Depending on the context and requirements, you may choose to accommodate these failure modes differently.

DBT and Jinja Templates

We created a Q-string class that makes it extremely convenient to build, run, and test SQL queries from string or file templates. Much of that functionality is part of more advanced tools, most notably dbt (data build tools), which leverages customized Jinja templates. In my view, our ~40 lines of code take care of a decent chunk of what dbt is about (version control, documentation, modularity). But we wrote it ourselves exactly the way we wanted, without a significant learning curve and the strings attached to an open-source tool backed by a commercial product.

There is much said on the subject of “should data engineers learn dbt”, and my take is…I don’t think you need to worry about dbt and friends in the first versions of your project(s). I’ve built a number of applications with Q-strings and versioned queries and data that are still running after I’m gone. I’m not claiming Q-strings are better than dbt — only that they’re a very versatile block of code that you might find useful in your data products as well — including those that use dbt.

In this and previous articles, we’ve been exploring a simple idea of subclassing a python string to create a number of handy tools that make SQL+Python union even more delightful. In the future articles, we’ll further extend the Q class to add more bells and whistles for safety and convenience. Stay tuned, and let me know what features you found useful!

--

--

SELECT * INTO Pay It Forward

Share knowledge, gain wisdom. Original content on coding (python & friends of modern data supply chain) and personal growth. https://www.linkedin.com/in/alekis