Slowly Changing Dimensions (SCD) Type 2 and effective ways of handling it in Apache Spark

Jafeer Ali
3 min readMay 8, 2019

--

What is SCD type 2?

As per oracle documentation, “A Type 2 SCD retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record. Each record contains the effective time and expiration time to identify the time period between which the record was active.”

Lets Start with example, Our current Hive/delta table looks like below

current table

New CSV with effective from June 2019 looks like below

new csv -june 2019

And current table should be updated as follow,

Explanation:

  • Employee 1 remains unchanged. So no change to Effective From and Effective Upto.
  • Employee 2 has been changed. So Effective Upto is updated to May 2019 for the old record and a new record is created with Effective From June 2019.
  • Employee 3 has no record in new CSV file. So Effective Upto is updated to May 2019.
  • Employee 4 is a new record in new CSV file. So new record is created with Effective From June 2019.

Implementation steps in Spark:

Assumptions:

  • Current data frame — it is the current dataframe which reads data from Hive/delta.
  • New data frame — it is new data which is going to be updated to the current dataframe.

Record Hash Computation:

  • Concatenate all columns and apply hashing for New Data frame.
  • Let us assume record hash already exists in Current dataframe.
  • Meta information about table row like effective from, effective upto excluded as part of record hash computation. [dropColsInHash ]

Steps:

From above example, we can clearly see we need to handle four scenarios.

1) Unchanged

2) New

3) Delete

4) Update

  1. Unchanged
  • Inner join both dataframe with record hash as join column

2) New

  • Left anti join of new dataframe and current dataframe where effective up to is null.
  • Effective from — new date
  • Effective upto — Null

3) Delete

  • Left anti join of current dataframe and new dataframe.
  • Effective from — No change
  • Effective upto — previous month of new date.

4) Update

  • Old record is handled as part of Delete.
  • New Record is handled as part of Insert.

Final Result is derived by union of unchanged, insert and delete and overwrite the Hive/delta table with result.

That’s it. Refer my github project for complete implementation.

Advantages of this method

  • This overwrite approach overcomes the issue of hive not supporting merge functionality.

--

--

Jafeer Ali

Data Engineer | Aspiring data scientist | Spark developer | Java developer