IMPLEMENT INCREMENTAL DATA LOAD ETL WITH SQL (FOR LARGE DATASETS AND REAL WORLD SCENARIOS)

Kingsley Okoye
7 min readMay 25, 2024

--

In our modern data driven landscape, organizations are depending on data more than ever to make informed decisions, with transactional data constantly coming in for lots of businesses, it is of utmost importance for data practitioners to understand how to integrate new data with already existing data and ensure stored data is accurate and reflects the true numbers being generated by the business.

This is why it is important to understand data integration, SQL being one of the most common languages used to interact with databases can be used to make this process seamless and in this article, I’ll be explaining how you can get this done using a simple and intuitive SQL function.

In my last article, this: https://medium.com/@exceldispensing/this-simple-sql-function-will-optimize-your-etl-process-b1744bad87c6 I talked about how using the MERGE function in SQL can simplify your data integration process and ensure that data getting into your data warehouse is accurate and updated but in that article I presented a very simple example where we assumed each table would have a primary key or column where every single row had a completely different number or value from every the other row in the table and while this can be a fairly common scenario when dealing with real world data, it is not always the case.

Hereby, in this article, I’ll be explaining how you can use the MERGE function even when there is no primary key column in your table.

Consider table below:

FACT TABLE

The table has no UNIQUE ID column and doesn’t have any column that is automatically increases by 1 when new data is inserted.

So what do we do in this situation?

Lets go back to the concept of JOINS in SQL.

When you join two tables based on a COMMON FIELD OR COLUMN, it is expected that:

  1. One table has totally unique values for that column while the other table can have several occurrences of the same unique value from the other table.

Example, a customer table will have one row for each customer but each customer might have placed one order or multiple orders as seen below.

You can easily join these two tables using just the CUSTOMER ID column because each CUSTOMER ID in the CUSTOMER TABLE is unique. This is a ONE TO MANY relationship.

2. The other scenario would be that both tables have a single occurrence of each unique values in both tables as seen below. This is a ONE TO ONE relationship.

ONE TO ONE RELATIONSHIP

Lets reconsider the CUSTOMER ORDERS relationship, imagine a scenario where the CUSTOMER TABLE had several occurrences of the same CUSTOMER ID as seen below.

Two different customers (YEMI ALADE and BOMA DAVID) have the same exact CUSTOMER ID which is 2.

How would you join these two tables in this situation?

Imagine, you need to calculate TOTAL CUSTOMER SPEND BY CUSTOMER REGION?

Lets calculate TOTAL CUSTOMER SPEND BY REGION by joining both tables only on the CUSTOMER ID column.

   SELECT c.customer_region, sum(ORDER_AMOUNT)
FROM ORDERS O
LEFT JOIN CUSTOMERS C
ON O.CUSTOMER_ID = C.CUSTOMER_ID
GROUP BY c.customer_region

Using above script to calculate total spend by each customer would be wrong. RESULT BELOW:

Above answer would be totally wrong because we know the total spend by our only two AFRICAN customers was 2,270. Calculated using query below

The reason why we got a wrong answer earlier is simply because we used the wrong join logic, this is why understanding the relationship between tables before joining them is extremely important in SQL.

In a nutshell, the ideal join for these tables would be a join based on multiple fields/columns as seen below.

CUSTOMER ID is duplicated hereby we join on both CUSTOMER ID and CUSTOMER NAME.

SCRIPT BELOW:


SELECT c.customer_region, sum(ORDER_AMOUNT) AS TOTAL_SPEND_BY_REGION
FROM ORDERS O
LEFT JOIN CUSTOMERS C
ON O.CUSTOMER_ID = C.CUSTOMER_ID <---
AND O.CUSTOMER_NAME = C.CUSTOMER_NAME <---
GROUP BY c.customer_region

RESULT:

Now, we have generated the right answer after correcting our query.

BTW, if you’ve found this insightful so far, kindly connect with me on LINKED IN using LINK below.

https://www.linkedin.com/in/ifeanyi-okoye-a857911ba/

This generates the exact table and answer we need.

You will encounter scenarios like these when working with real data so always take your time to check.

Going further, this is the exact logic we will be using in this scenario where our FACT TABLE has no unique column.

We will joining on SOURCE TABLE TO DESTINATION TABLE on specific columns using the MERGE function and then updating records which have changed and inserting new records.

SIMPLE.

Imagine this is your source data and it has 907,841 rows, you haven’t pulled data from source table into destination table in 4 months and you need to create the quarterly report for stake holders.

Over the course of the last 4 months, lots of new records have been inserted into the source table and some records have also been updated. You need to reflect all these in your upcoming report, here is how to go about integrating the data from your SOURCE TABLE into your DESTINATION TABLE using the MERGE function.

The DESTINATION TABLE only has data for 2017 and 2018 while the SOURCE TABLE has records for 2017, 2018 and 2019.

These records below have changed in the SOURCE TABLE and we must reflect these changes in the DESTINATION TABLE as well.

QUANTITY SOLD column in SOURCE TABLE (updated in SOURCE TABLE)

QUANTITY SOLD column in DESTINATION TABLE (not yet updated in DESTINATION TABLE)

Change has to be reflected in DESTINATION TABLE.

STEP 1

ENSURE THAT SOURCE TABLE HAS NO DUPLICATES


WITH NUM_RANKING AS
(
SELECT *,
RANK() OVER (
PARTITION BY
[transaction_id],
[transaction_date],
[transaction_time],
[store_id],
[staff_id],
[customer_id],
[instore_yn],
[orderING],
[line_item_id],
[product_id]
ORDER BY
[transaction_id],
[transaction_date],
[transaction_time],
[store_id],
[staff_id],
[customer_id],
[instore_yn],
[orderING],
[line_item_id],
PRODUCT_ID) AS RANKING
FROM [dbo].[M_ARTICLE_SALES_BY_STORE_TABLE]
)
SELECT *
FROM NUM_RANKING
WHERE RANKING > 1

We get an empty table as seen below meaning there are no duplicates

STEP 2: MERGE THE SOURCE TABLE AND DESTINATION TABLE ON COLUMNS USED ABOVE AND DELETE RECORDS THAT ARE NOT MATCHING. QUERY CAN BE VIEWED BELOW.

With the query above, you are telling SQL to update records where specified fields are matching and delete records where there is no match.

With the above SQL statement, what you are telling SQL to do it, insert all new records where the SOURCE AND DESTINATION TABLES are matching on the fields specified and update the DESTINATION TABLE on the rows that have changed in the SOURCE TABLE.

We can check if the updated data from SOURCE TABLE has successfully updated in the DESTINATION TABLE.

Updated records in SOURCE TABLE have been successfully updated in DESTINATION TABLE

STEP 4: VALIDATE THAT BOTH SOURCE AND DESTINATION TABLES ARE MATCHING USING THIS SQL QUERY. IF YOU HAVE DONE THE RIGHT RIGHT THING, THE QUERY SHOULD RETURN AN EMPTY TABLE.

VIEW QUERY BELOW.

    SELECT * FROM M_ARTICLE_SALES_BY_STORE_TABLE
EXCEPT
SELECT * FROM DW_SALES_BY_STORE_TAB

There you go, you have successfully integrated new data into your DESTINATION TABLE and validated it.

Simple and straight forward.

Once again, remember this works in SQL Server and newer versions of POSTGRES SQL but not in MYSQL.

If you found this article helpful, kindly give this a CLAP also feel free to comment if you wanna contribute to the article.

I’m actively looking for a role in data analytics, kindly contact me if I fit any position you are currently in need of.

You can connect with me on LINKED IN via this LINK BELOW:

www.linkedin.com/in/ifeanyi-okoye-a857911ba

Thank you for reading.

--

--

Kingsley Okoye

Data analyst| | Business Analytics | ETL developer | Connect with me on LINKED IN | LINKED IN: www.linkedin.com/in/ifeanyi-okoye-a857911ba