Delta Load with SCD2- Applying Adjustments to your Base Data

Abhishek Trehan
1 min readJul 18, 2024

--

In continuation to previous article where we talked about implementing SCD load to your target, this is a follow-up article to the same. In this article we will talk about when source is sending only incremental change.

Photo by Claudio Schwarz on Unsplash

Data used here for testing here has about 200+ million records and the total population of table is about 2 billion records. Load time I was able to achieve here was about 138 sec on spark 3.5

Consider your source data is in Delta mode, which means source will send you new or updated data only. Scenario #3 is not valid in this situation.

Process here is as follows: Stage the incoming data into a stage table, apply merge and then the inserts

            (df
.write.mode("overwrite")
.option("mergeSchema", "true")
.saveAsTable(f"source_stg"))

spark.sql(f"""MERGE INTO Target AS T
USING Source_stg AS S
ON T.key= S.key
WHEN MATCHED and T.col1 = S.col1 and coalesce(T.END_Dt, '9999-12-31') = '9999-12-31'
THEN UPDATE SET T.END_Dt=S.START_Dt
""")
spark.sql(f"INSERT INTO Target TABLE Source_stg")

Your load time can vary based on compute sizes. The optimal compute used in this case is configured on Databricks as follows:

Driver: m6i.xlarge

Workers: m6i.8xlarge

20–30 workers

DBR: 14.3.x-photon-scala2.12

--

--