THIS SIMPLE SQL FUNCTION WILL OPTIMIZE YOUR ETL PROCESS.

Kingsley Okoye
6 min readMay 19, 2024

--

One process you will have to go through as a data engineer or analytics engineer even sometimes as a data analyst is integrating new data into your already existing data warehouse as well as ensuring your integrated data in properly validated.

As an ETL developer, it is important to ensure that the data you are making available to your downstream users is as accurate as possible, nothing is as bad as a stake holder questioning a bar chart visual in a report, after which you go back to your source data only to find out you or your reporting analyst actually created a dashboard using wrong numbers and the stake holder was actually right.

This is why proper data integration is important and in this article I will show you how you can use the MERGE function in SQL to seamlessly integrate data from the source transactional database (OLTP) into your already existing data warehouse.

Consider table below as our source table on day 1

From our source table we simply insert all records from our source table into our destination table using SQL statement below.

                
INSERT INTO TARGET_ORDERS_TABLE
(ORDER_ID, PRODUCT_NAME, PRODUCT_COST,
PRODUCT_SALES_PRICE, QUANTITY_SOLD, TOTAL_ORDER_AMOUNT)
SELECT ORDER_ID, PRODUCT_NAME, PRODUCT_COST,
PRODUCT_SALES_PRICE, QUANTITY_SOLD, TOTAL_ORDER_AMOUNT
FROM SOURCE_ORDERS_TABLE;

The next day new records are added to our SOURCE TABLE and some records are updated as well. Lets take a look at our SOURCE TABLE before new data is added and some old data is updated.

Below are our source table records before new transactions were added and some existing transactions were updated.

SOURCE TABLE BEFORE UPDATE

Below is our SOURCE TABLE records after new transactions have been added and some existing transactions have been updated.

NEW DATA UPDATED DATA

In order to bring integrate new data, we can decide to only insert new records from the source table in the target table using this syntax below.

               DECLARE @MAX_ORDER_ID INT;

SELECT @MAX_ORDER_ID = MAX(ORDER_ID)
FROM SOURCE_ORDERS_TABLE;

SELECT *
FROM SOURCE_ORDERS_TABLE
WHERE ORDER_ID > @MAX_ORDER_ID;

What the above query is doing is selecting all new records using order_id as the filtering column, basically all records from the SOURCE TABLE which have a higher ORDER_ID number than the max ORDER_ID in the target table will be inserted into the target table.

It generates this result.

With this we would only be getting new data into our target table but we would be omitting updated data from source table.

This is exactly where MERGE statement comes in handy.

MERGE function is a SQL function that you can use to INSERT new records into a table, UPDATE records existing in a table as well as DELETE unwanted records from a table based on a specific criteria.

Below is the basic syntax for MERGE statement.

BTW, you can connect with me on LINKED IN using link below.

www.linkedin.com/in/ifeanyi-okoye-a857911ba

It only takes 10 seconds.

MERGE INTO target_table
USING source_table
ON merge_condition
WHEN MATCHED THEN
UPDATE SET column1 = value1 [, column2 = value2 …]
WHEN NOT MATCHED THEN
INSERT (column1 [, column2 …])
VALUES (value1 [, value2 …]);

Here is a better way to integrate new data

     BEGIN TRAN;
MERGE INTO TARGET_ORDERS_TABLE AS TARGET
USING SOURCE_ORDERS_TABLE AS SOURCE
ON TARGET.ORDER_ID = SOURCE.ORDER_ID
WHEN MATCHED THEN UPDATE
SET TARGET.PRODUCT_NAME = SOURCE.PRODUCT_NAME,
TARGET.PRODUCT_COST = SOURCE.PRODUCT_COST,
TARGET.PRODUCT_SALES_PRICE = SOURCE.PRODUCT_SALES_PRICE,
TARGET.QUANTITY_SOLD = SOURCE.QUANTITY_SOLD,
TARGET.TOTAL_ORDER_AMOUNT = SOURCE.TOTAL_ORDER_AMOUNT
WHEN NOT MATCHED THEN
INSERT (ORDER_ID, PRODUCT_NAME, PRODUCT_COST,
PRODUCT_SALES_PRICE, QUANTITY_SOLD, TOTAL_ORDER_AMOUNT)
VALUES (SOURCE.ORDER_ID,
SOURCE.PRODUCT_NAME,
SOURCE.PRODUCT_COST,
SOURCE.PRODUCT_SALES_PRICE,
SOURCE.QUANTITY_SOLD,
SOURCE.TOTAL_ORDER

Now our target table and source table are matching.

You can validate that both tables have matching records by using the validation script below.

 SELECT ORDER_ID, PRODUCT_COST, PRODUCT_SALES_PRICE, 
QUANTITY_SOLD, TOTAL_ORDER_AMOUNT
FROM SOURCE_ORDERS_TABLE
EXCEPT
SELECT ORDER_ID, PRODUCT_COST, PRODUCT_SALES_PRICE,
QUANTITY_SOLD, TOTAL_ORDER_AMOUNT
FROM TARGET_ORDERS_TABLE

It generates an empty table meaning both tables are matching.

Also, note that when working on anything like updating, inserting or deleting records in tables, it is important to use the BEGIN TRAN, ROLLBACK OR COMMIT commands. This ensures that if there is any mistake, you can go back to the previous state of the table and correct your mistakes, this can be done by running the whole script and putting BEGIN TRAN at the beginning of the script and ROLLBACK command in end of the script as seen below.

BEGIN TRAN;
MERGE INTO TARGET_ORDERS_TABLE AS TARGET
USING SOURCE_ORDERS_TABLE AS SOURCE
ON TARGET.ORDER_ID = SOURCE.ORDER_ID
WHEN MATCHED THEN UPDATE
SET TARGET.PRODUCT_NAME = SOURCE.PRODUCT_NAME,
TARGET.PRODUCT_COST = SOURCE.PRODUCT_COST,
TARGET.PRODUCT_SALES_PRICE = SOURCE.PRODUCT_SALES_PRICE,
TARGET.QUANTITY_SOLD = SOURCE.QUANTITY_SOLD,
TARGET.TOTAL_ORDER_AMOUNT = SOURCE.TOTAL_ORDER_AMOUNT
WHEN NOT MATCHED THEN
INSERT (ORDER_ID, PRODUCT_NAME, PRODUCT_COST,
PRODUCT_SALES_PRICE, QUANTITY_SOLD, TOTAL_ORDER_AMOUNT)
VALUES (SOURCE.ORDER_ID,
SOURCE.PRODUCT_NAME,
SOURCE.PRODUCT_COST,
SOURCE.PRODUCT_SALES_PRICE,
SOURCE.QUANTITY_SOLD,
SOURCE.TOTAL_ORDER;
ROLLBACK; <---USE THIS TO GO BACK TO THE PREVIOUS STATE OF TABLE

Once you confirm that you have done the right thing then, use the COMMIT command to keep the table in the current state.

BEGIN TRAN;
MERGE INTO TARGET_ORDERS_TABLE AS TARGET
USING SOURCE_ORDERS_TABLE AS SOURCE
ON TARGET.ORDER_ID = SOURCE.ORDER_ID
WHEN MATCHED THEN UPDATE
SET TARGET.PRODUCT_NAME = SOURCE.PRODUCT_NAME,
TARGET.PRODUCT_COST = SOURCE.PRODUCT_COST,
TARGET.PRODUCT_SALES_PRICE = SOURCE.PRODUCT_SALES_PRICE,
TARGET.QUANTITY_SOLD = SOURCE.QUANTITY_SOLD,
TARGET.TOTAL_ORDER_AMOUNT = SOURCE.TOTAL_ORDER_AMOUNT
WHEN NOT MATCHED THEN
INSERT (ORDER_ID, PRODUCT_NAME, PRODUCT_COST,
PRODUCT_SALES_PRICE, QUANTITY_SOLD, TOTAL_ORDER_AMOUNT)
VALUES (SOURCE.ORDER_ID,
SOURCE.PRODUCT_NAME,
SOURCE.PRODUCT_COST,
SOURCE.PRODUCT_SALES_PRICE,
SOURCE.QUANTITY_SOLD,
SOURCE.TOTAL_ORDER;
COMMIT; <---USE THIS TO KEEP CURRENT STATE OF TABLE

NOTE: When using MERGE function, it is important to understand that there must be a primary key column, a primary key column is column with unique values as well as no NULL values, if there is no primary key this function has to be used in different way.

THIS→ 1,2,3,4,5,6,7,8,9,10

NOT THIS → 1,2,3,4,NULL,6,7,8,9,10

NOT THIS → 1,2,2,4,8,8,2,9

No number should repeat itself.

Also note that this works in SQL server as well as version 16 and newer for POSTGRESQL and might not also work in every version MYSQL yet.

Finally, this is a very simple example but when working with real world data you will come across situations where the tables are much more complicated, possibly having no primary key column but there is a way around it.

I have addressed this is a NEW article, check out the article by clicking link below:

https://medium.com/@exceldispensing/data-integration-made-easy-with-this-simple-sql-function-e5b37edcde45

Once again remember that you can connect with me on LINKED IN by using this link below:

www.linkedin.com/in/ifeanyi-okoye-a857911ba

Feel free to contribute to the article by commenting in the comment section and if you have found the article helpful, kindly like and share.

Till next time.

--

--

Kingsley Okoye

Data analyst| | Business Analytics | ETL developer | Connect with me on LINKED IN | LINKED IN: www.linkedin.com/in/ifeanyi-okoye-a857911ba