How to quickly compare two datasets using a generic & powerful SQL query

A step-by-step guide to ease datasets comparison via a ready-to-use Structured Query Language template

Kévin LEFEVRE

Published in

Artefact Engineering and Data Science

5 min readJan 12, 2022

Introduction

In their daily activities, Data Engineers and Data Analysts are required to improve data ingestion processes. Beyond the usual unit tests, it may be interesting to easily and quickly compare two datasets (i.e. tables, views, queries, etc) for different purposes such as impact analysis or non-regression testing. Also, discrepancies identification between two points of view of a snapshot table is quite useful for adhoc analysis or debugging.

With this in mind and for SQL practitioners, a pre-built and reusable script makes sense, hence the purpose of this article.

Ready-to-use SQL template explained

Let’s go to the heart of the matter with the expected query :

WITH dataset_1 AS (
    -- Add your first dataset here (query, table, view...)
    -- Eg: SELECT field_a, field_b FROM table_1
), dataset_2 AS (
    -- Add your second dataset here (query, table, view...)
    -- Eg: SELECT field_a, field_b FROM table_2
)
SELECT t.*
FROM (
    SELECT 'd1' AS flag, dataset_1_only.*
    FROM (
        SELECT * FROM dataset_1
        EXCEPT
        SELECT * FROM dataset_2
    ) AS dataset_1_only
    UNION ALL
    SELECT 'd2' AS flag, dataset_2_only.*
    FROM (
        SELECT * FROM dataset_2
        EXCEPT
        SELECT * FROM dataset_1
    ) AS dataset_2_only
) AS t 
ORDER BY t.key, t.flag -- Replace t.key by the field(s) referring to the dataset granularity
;

Now, let’s explain it :

First, we use CTE — Common Table Expressions (WITH ... ASstatements) to identify the two datasets to be compared. Thereby, the following query lines can stay unchanged.
Then, we construct two similar queries aliased dataset_1_only and dataset_2_onlyin order to keep rows that are only present in dataset 1 (compared with dataset 2) and vice-versa thanks to the EXCEPT set operator.
Finally, a flag complementary field is defined in order to identify the source of mismatching rows from both queries. These are then gathered via the UNION ALL set operator.

Note: obviously, having two datasets with comparable fields is a prerequisite to use the above query

Illustrated example

Enough with theory, now to practice! Suppose a supermarket wanting to retrieve all products eligible for a promotion in a single table according to the week promotion conditions.

As an initial statement, we consider below products in stock:

+------------+------------+----------------+
| product_id | shelf      | original_price |
+------------+------------+----------------+
|         10 | fruits     |              1 |
|         20 | fruits     |              1 |
|         30 | vegetables |              1 |
|         40 | vegetables |              1 |
|         50 | herbs      |              1 |
|         60 | ice-creams |              1 |
+------------+------------+----------------+

According to the promotion conditions in week 1 :

10% discount on fruits, vegetables and herbs shelves

the dataset_1 is:

+------------+------------+----------------+---------------+
| product_id | shelf      | original_price | reduced_price |
+------------+------------+----------------+---------------+
|         10 | fruits     |              1 |           0.9 |
|         20 | fruits     |              1 |           0.9 |
|         30 | vegetables |              1 |           0.9 |
|         40 | vegetables |              1 |           0.9 |
|         50 | herbs      |              1 |           0.9 |
+------------+------------+----------------+---------------+

In week 2, the promotion conditions evolve to:

10% discount on fruits and ice-creams shelves
20% discount on vegetables shelf
no discount on herbs shelf

The data pipeline evolution generates the dataset_2:

+------------+------------+----------------+---------------+
| product_id | shelf      | original_price | reduced_price |
+------------+------------+----------------+---------------+
|         10 | fruits     |              1 |           0.9 |
|         20 | fruits     |              1 |           0.9 |
|         30 | vegetables |              1 |           0.8 |
|         40 | vegetables |              1 |           0.8 |
|         60 | ice-creams |              1 |           0.9 |
+------------+------------+----------------+---------------+

But… are we really confident with the results? Typical questions are:

is there any missing or extra-rows?
are implemented evolutions correct on impacted fields?
is there any regression on non-impacted fields?

Note: even if it seems easy to answer above questions intuitively due to a trivial example on a small dataset, in real use cases we usually face plenty of rows and columns from complex queries (transformations, joins, aggregates, windows functions, …) letting this query take on its full meaning

Let’s try to answer the 3 questions interpreting the comparison query results:

+------+------------+------------+----------------+---------------+
| flag | product_id | shelf      | original_price | reduced_price |
+------+------------+------------+----------------+---------------+
| d1   |         30 | vegetables |              1 |           0.9 |
| d2   |         30 | vegetables |              1 |           0.8 |
| d1   |         40 | vegetables |              1 |           0.9 |
| d2   |         40 | vegetables |              1 |           0.8 |
| d1   |         50 | herbs      |              1 |           0.9 |
| d2   |         60 | ice-creams |              1 |           0.9 |
+------+------------+------------+----------------+---------------+

Observations:

No row with products from the fruits shelf: fine as the “10% discount on fruits” is remained unchanged from week 1 to week 2
Product_ID #50 (herbs shelf) is missing from dataset_2 (only in dataset_1): fine as the “discount on herbs” was stopped on week 2
Product_ID #60 (ice-creams shelf) is missing from dataset_1 (only in dataset_2): fine as the “discount on ice-creams” was introduced on week 2
Rows concerning Product_ID #30 and #40 (vegetables shelf) are diverging on reduced_price field: fine as the “discount on vegetables” increased from 10% on week 1 to 20% on week 2

Finally, everything looks good:

is there any missing or extra-rows? No!
are implemented evolutions correct on impacted fields? Yes!
is there any regression on non-impacted fields? No!

Tips and tricks

In CTE, even if you can copy-paste your ingestion SQL query before/after evolution, it could be a good idea to store the results in temporary tables to simplify the comparison, improve the query performances and benefit from caching if you run it multiple times.

In real-world, divergences between two datasets can be messy. Then, sorting results by key (the granularity of the datasets) and flag can hugely help interprating and comparing equivalent rows coming from both datasets.

To ease the investigations and identify the gap origin, the suspicious datasets common fields can be removed (i.e. commented in CTE) from comparison: if there is no result, it means all compared fields are equal. You can then focus on remaining suspicious fields only for the next comparison and make it step-by-step.

Note: an equivalent analysis could have been performed using LEFT JOIN strategy but would have been much more difficult to maintain (NULLs & fields comparison management) and less efficient as set operators are more powerful than joins.

Summary

This templated query therefore makes it easier for developers to quickly validate changes on complex data pipelines. It is useful as a complement to more traditional unit tests, and can even be considered in a more general way when comparing any two datasets with similar structures. Last but not least, the logic of this query is even easy to learn by heart!

Thank you for reading, I hope it was clear and I would be glad to hear your feedback :)