Unit testing SQL queries with DuckDB

Coste Virgile
Clarity AI Tech
Published in
6 min readJan 18, 2023

If you work in data wonderland, chances are that SQL is one of your main programming languages: combined with a powerful engine (BigQuery, Snowflake, Redshift…), SQL queries let you quickly transform datasets the way you want.

However, SQL queries do not seem to be unit-tested that much… And pushing code (SQL queries) to production without proper unit testing seems dangerous to me: when you or other teammates have to modify existing queries it can quickly become a tedious and error-prone process. This situation can quickly lead team members to create new and often duplicated tables, instead of modifying existing ones, to avoid the risk of corrupting or even breaking pipelines that run fine in production…

There must be a better way! And this is what I would like to discuss in this article. Before jumping on how to use DuckDB for unit testing SQL queries, let me briefly introduce the options I came across. We’ll then see how we could use DuckDB with pandas as a good complementary solution.

Zero data cloning

The best solution that I’ve seen involves zero data cloning. We used it in some of my previous experiences and a friend and former teammate wrote a great article about it.

The idea is to clone your production data warehouse into much lighter testing warehouses (with a limited lifespan) and run your queries against those testing warehouses.

It works well and it will definitely help you create a safer working environment. It really protects you when you modify a table structure: adding/renaming/deleting columns and/or changing a column data type. With the testing warehouse, you can check if all the downstream queries of a modified query (table) are still correctly running.

However, because you don’t use data here, you are not really testing the output of your query given a set of inputs. Of course, you could add data to your testing warehouses. But this can quickly become a hassle. How/where do you store your testing inputs? How/where do you store your expected outputs given your set of inputs? How to automate this in your CI/CD pipeline to have the possibility to quickly and easily change your input/output data?

Production data validation

To do data validation you can use open-source tools (such as Great Expectations or Soda Core). They are awesome and will clearly do a great job at checking your output datasets.

However, in a typical setup, you would use them in your production data pipeline. So you will be checking that your production outputs have some expected properties and break if that’s not the case. It’s great but it doesn’t really serve the same purposes as unit tests.

Now that I present some of the available options I know about and what seems to be missing, let’s see how DuckDB fits in the picture and how it can help you with testing queries against a set of inputs.

DuckDB for data unit testing

Recently, I found this blog post about the usage of DuckDB to create a “poor man data warehouse”. What stroke me was how the author designed his code to be used either against his data warehouse or against some pandas DataFrame for unit testing.

To make more sense of what the author is doing, let’s take a simple example:

import duckdb
import pandas as pd


def order_table_fixture() -> pd.DataFrame:
return pd.DataFrame(
[
{"order_id": 1, "order_value": 9.9, "user_id": "a"},
{"order_id": 2, "order_value": 8.9, "user_id": "b"},
{"order_id": 3, "order_value": 15.5, "user_id": "c"},
]
)


def user_table_fixture() -> pd.DataFrame:
return pd.DataFrame(
[
{"id": "a", "email": "user_a@domain.fr"},
{"id": "b", "email": "user_b@domain.de"},
]
)

# Retrieve email addresses of users having place an order of more than 9 euros
QUERY = """
SELECT
order_table.order_id
, order_value
, email
FROM dfA AS order_table -- <= queriying dfA !
LEFT JOIN dfB AS user_table -- <= queriying dfB !
ON order_table.user_id = user_table.id
WHERE order_table.order_value > 9
AND user_table.email IS NOT NULL
"""


if __name__ == "__main__":
dfA = order_table_fixture()
dfB = user_table_fixture()
result = duckdb.query(QUERY)

Take a look at the query and notice how we referenced tables (dfA and dfB). DuckDB will automatically resolve these table names with the python variables in our codes: dfA and dfB are pandas DataFrame containing our data examples. That’s great! But how do we take it to the next level?

A viable testing scenario

Building on the previous example, we could design the following scenario where we write our production queries that will run against our data warehouse. But we could take those exact queries and run them against panda DataFrames using DuckDB and some string replacement.

Production vs Unit tests pipeline

Here is how we could implement this in a quick and dirty way:

import duckdb
import pandas as pd
import pytest
from pandas.testing import assert_frame_equal
from sql_metadata import Parser


class TestOrderUser:
def setup(self):
self.replacements = {
"my_database.my_schema.my_order_table": "order_table",
"my_database.my_schema.my_user_table": "user_table",
}
with open("s3duckdb/order_user.sql") as f:
self.query = f.read()

def test_query(self, order_table: pd.DataFrame, user_table: pd.DataFrame):
query = self.query
tables = Parser(self.query).tables
for table in tables:
if table in self.replacements:
query = query.replace(table, self.replacements[table])

result = duckdb.query(query).to_df()

assert_frame_equal(
result,
pd.DataFrame(
[{"order_id": 1, "order_value": 9.9, "email": "user_a@domain.fr"}]
),
)

@pytest.fixture
def order_table(self) -> pd.DataFrame:
return pd.DataFrame(
[
{"order_id": 1, "order_value": 9.9, "user_id": "a"},
{"order_id": 2, "order_value": 8.9, "user_id": "b"},
{"order_id": 3, "order_value": 15.5, "user_id": "c"},
]
)

@pytest.fixture
def user_table(self) -> pd.DataFrame:
return pd.DataFrame(
[
{"id": "a", "email": "user_a@domain.fr"},
{"id": "b", "email": "user_b@domain.de"},
]
)
-- production_queries/order_user.sql

SELECT
order_table.order_id
, order_value
, email
FROM my_database.my_schema.my_order_table AS order_table
LEFT JOIN my_database.my_schema.my_user_table AS user_table
ON order_table.user_id = user_table.id
WHERE order_table.order_value > 9
AND user_table.email IS NOT NULL

Interesting parts:

  • we use sql_metadata to automatically extract tables used in a given query: query parsing is a solved problem and it’s now very easy to extract all kinds of information from a SQL query (tables used, columns used, …).
  • we create a replacements dictionary to map the extracted table names with what we want to replace them, and we do the replacement.
  • we use some fixtures to return pandas DataFrames with interesting values we want to test our query against.

Pros and cons

✅ You could now unit test your SQL queries along with testing how your queries behave with each other (Zero data cloning option) and monitoring the quality of your production outputs (Data validation option)!

✅ You can design fixtures DataFrames with your own convoluted data (for some edge cases you want to handle). But you could also leverage data-generating tools (Faker for example) or even use property-based testing (with hypothesis for example; check out this article to know more about it)

❌ You use a different query engine for production and for unit tests. I’m not sure how big of an issue this is. If anyone has some examples that would work on Snowflake/BigQuery and not on DuckDB I would really appreciate you letting me know in the comment section.

ℹ️ You do table name replacements. I feel I need to talk a bit about it but, because I don’t really think it’s a con, I put it as information. This is why I don’t think it’s a con:

  • you do the replacement only at the testing unit stage. The SQL queries that you write are production queries with production table names. You can copy/paste those queries and run them as is on your production data warehouse.
  • if you use dbt (fantastic tool) you kind of do the same thing. You use Jinja templating to specify the table name placeholders in your query. By the way, you are no longer able to copy-paste your query directly to your production data warehouse (see this article about some limitations of dbt which is still an awesome tool)
  • as said, extracting used table names from a query (for lineage or dynamic replacement) is a solved problem with very good libraries doing the job for you

Next steps

Let me know if you also think that this feature of DuckDB is useful! I think seeing how to use it within a dbt project could be a nice follow-up article.

Thanks for reading and happy to take your feedback!

--

--