Delta Load with SCD2- Applying Adjustments to your Base Data

1 min readJul 18, 2024

In continuation to previous article where we talked about implementing SCD load to your target, this is a follow-up article to the same. In this article we will talk about when source is sending only incremental change.

Data used here for testing here has about 200+ million records and the total population of table is about 2 billion records. Load time I was able to achieve here was about 138 sec on spark 3.5

Consider your source data is in Delta mode, which means source will send you new or updated data only. Scenario #3 is not valid in this situation.

Process here is as follows: Stage the incoming data into a stage table, apply merge and then the inserts

            (df
             .write.mode("overwrite")
             .option("mergeSchema", "true")
             .saveAsTable(f"source_stg"))


spark.sql(f"""MERGE INTO Target AS T 
                                        USING Source_stg AS S 
                                        ON T.key= S.key
                                        WHEN MATCHED and T.col1 = S.col1 and coalesce(T.END_Dt, '9999-12-31') = '9999-12-31' 
                                        THEN UPDATE SET T.END_Dt=S.START_Dt
                                        """)

spark.sql(f"INSERT INTO Target TABLE Source_stg")

Your load time can vary based on compute sizes. The optimal compute used in this case is configured on Databricks as follows:

Driver: m6i.xlarge
Workers: m6i.8xlarge
20–30 workers
DBR: 14.3.x-photon-scala2.12

Delta Load with SCD2- Applying Adjustments to your Base Data

Written by Abhishek Trehan