How to quickly compare two datasets using a generic & powerful SQL query
A step-by-step guide to ease datasets comparison via a ready-to-use Structured Query Language template
Introduction
In their daily activities, Data Engineers and Data Analysts are required to improve data ingestion processes. Beyond the usual unit tests, it may be interesting to easily and quickly compare two datasets (i.e. tables, views, queries, etc) for different purposes such as impact analysis or non-regression testing. Also, discrepancies identification between two points of view of a snapshot table is quite useful for adhoc analysis or debugging.
With this in mind and for SQL practitioners, a pre-built and reusable script makes sense, hence the purpose of this article.
Ready-to-use SQL template explained
Let’s go to the heart of the matter with the expected query :
WITH dataset_1 AS (
-- Add your first dataset here (query, table, view...)
-- Eg: SELECT field_a, field_b FROM table_1
), dataset_2 AS (
-- Add your second dataset here (query, table, view...)
-- Eg: SELECT field_a, field_b FROM table_2
)
SELECT t.*
FROM (
SELECT 'd1' AS flag, dataset_1_only.*
FROM (
SELECT * FROM dataset_1
EXCEPT
SELECT * FROM dataset_2
) AS dataset_1_only
UNION ALL
SELECT 'd2' AS flag, dataset_2_only.*
FROM (
SELECT * FROM dataset_2
EXCEPT
SELECT * FROM dataset_1
) AS dataset_2_only
) AS t
ORDER BY t.key, t.flag -- Replace t.key by the field(s) referring to the dataset granularity
;
Now, let’s explain it :
- First, we use CTE — Common Table Expressions (
WITH ... AS
statements) to identify the two datasets to be compared. Thereby, the following query lines can stay unchanged. - Then, we construct two similar queries aliased
dataset_1_only
anddataset_2_only
in order to keep rows that are only present in dataset 1 (compared with dataset 2) and vice-versa thanks to theEXCEPT
set operator. - Finally, a
flag
complementary field is defined in order to identify the source of mismatching rows from both queries. These are then gathered via theUNION ALL
set operator.
Note: obviously, having two datasets with comparable fields is a prerequisite to use the above query
Illustrated example
Enough with theory, now to practice! Suppose a supermarket wanting to retrieve all products eligible for a promotion in a single table according to the week promotion conditions.
As an initial statement, we consider below products in stock:
+------------+------------+----------------+
| product_id | shelf | original_price |
+------------+------------+----------------+
| 10 | fruits | 1 |
| 20 | fruits | 1 |
| 30 | vegetables | 1 |
| 40 | vegetables | 1 |
| 50 | herbs | 1 |
| 60 | ice-creams | 1 |
+------------+------------+----------------+
According to the promotion conditions in week 1 :
- 10% discount on fruits, vegetables and herbs shelves
the dataset_1
is:
+------------+------------+----------------+---------------+
| product_id | shelf | original_price | reduced_price |
+------------+------------+----------------+---------------+
| 10 | fruits | 1 | 0.9 |
| 20 | fruits | 1 | 0.9 |
| 30 | vegetables | 1 | 0.9 |
| 40 | vegetables | 1 | 0.9 |
| 50 | herbs | 1 | 0.9 |
+------------+------------+----------------+---------------+
In week 2, the promotion conditions evolve to:
- 10% discount on fruits and ice-creams shelves
- 20% discount on vegetables shelf
- no discount on herbs shelf
The data pipeline evolution generates the dataset_2
:
+------------+------------+----------------+---------------+
| product_id | shelf | original_price | reduced_price |
+------------+------------+----------------+---------------+
| 10 | fruits | 1 | 0.9 |
| 20 | fruits | 1 | 0.9 |
| 30 | vegetables | 1 | 0.8 |
| 40 | vegetables | 1 | 0.8 |
| 60 | ice-creams | 1 | 0.9 |
+------------+------------+----------------+---------------+
But… are we really confident with the results? Typical questions are:
- is there any missing or extra-rows?
- are implemented evolutions correct on impacted fields?
- is there any regression on non-impacted fields?
Note: even if it seems easy to answer above questions intuitively due to a trivial example on a small dataset, in real use cases we usually face plenty of rows and columns from complex queries (transformations, joins, aggregates, windows functions, …) letting this query take on its full meaning
Let’s try to answer the 3 questions interpreting the comparison query results:
+------+------------+------------+----------------+---------------+
| flag | product_id | shelf | original_price | reduced_price |
+------+------------+------------+----------------+---------------+
| d1 | 30 | vegetables | 1 | 0.9 |
| d2 | 30 | vegetables | 1 | 0.8 |
| d1 | 40 | vegetables | 1 | 0.9 |
| d2 | 40 | vegetables | 1 | 0.8 |
| d1 | 50 | herbs | 1 | 0.9 |
| d2 | 60 | ice-creams | 1 | 0.9 |
+------+------------+------------+----------------+---------------+
Observations:
- No row with products from the fruits shelf: fine as the “10% discount on fruits” is remained unchanged from week 1 to week 2
- Product_ID #50 (herbs shelf) is missing from
dataset_2
(only indataset_1
): fine as the “discount on herbs” was stopped on week 2 - Product_ID #60 (ice-creams shelf) is missing from
dataset_1
(only indataset_2
): fine as the “discount on ice-creams” was introduced on week 2 - Rows concerning Product_ID #30 and #40 (vegetables shelf) are diverging on
reduced_price
field: fine as the “discount on vegetables” increased from 10% on week 1 to 20% on week 2
Finally, everything looks good:
- is there any missing or extra-rows? No!
- are implemented evolutions correct on impacted fields? Yes!
- is there any regression on non-impacted fields? No!
Tips and tricks
In CTE, even if you can copy-paste your ingestion SQL query before/after evolution, it could be a good idea to store the results in temporary tables to simplify the comparison, improve the query performances and benefit from caching if you run it multiple times.
In real-world, divergences between two datasets can be messy. Then, sorting results by key (the granularity of the datasets) and flag can hugely help interprating and comparing equivalent rows coming from both datasets.
To ease the investigations and identify the gap origin, the suspicious datasets common fields can be removed (i.e. commented in CTE) from comparison: if there is no result, it means all compared fields are equal. You can then focus on remaining suspicious fields only for the next comparison and make it step-by-step.
Note: an equivalent analysis could have been performed using
LEFT JOIN
strategy but would have been much more difficult to maintain (NULLs & fields comparison management) and less efficient as set operators are more powerful than joins.
Summary
This templated query therefore makes it easier for developers to quickly validate changes on complex data pipelines. It is useful as a complement to more traditional unit tests, and can even be considered in a more general way when comparing any two datasets with similar structures. Last but not least, the logic of this query is even easy to learn by heart!
Thank you for reading, I hope it was clear and I would be glad to hear your feedback :)