Implementation of SCD-2 (Slowly Changing Dimension) with Apache Hudi & Spark

Jayasheel Kalgal

Published in

Walmart Global Tech Blog

3 min readAug 24, 2022

Authored & Contributed by : Jayasheel Kalgal, Esha Dhing, Prashant Mishra

Stack of files and paperwork — Image source : https://pixabay.com/photos/files-paper-office-paperwork-stack-1614223/

What are we trying to solve ?

DATA is an invaluable asset in today’s analytical world. While serving data to end users, it’s very important to track the changes of the data over the period of time. Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time. In types of SCD, we will particularly concentrate on type 2 (SCD 2), which retains the full history of values. Each record contains the effective time and expiration time to identify the time period between which the record was active. This can be achieved with the help few audit columns. Ex : Effective Start Date, Effective End Date and Active Record Indicator.

Let’s learn how this SCD-2 table design can be implemented by using — Apache Hudi & Spark

Apache Hudi is the next generation streaming data lake platform. Apache Hudi brings core warehouse and database functionality directly to a data lake. Hudi provides tables, transactions, efficient upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency all while keeping your data in open source file formats.

Apache Hudi always shows snapshot data from the table, that is latest data with recent commit. If we want to track historical changes, we need to leverage Point in Time Query of Hudi (https://hudi.apache.org/docs/quick-start-guide#point-in-time-query)

Hudi allows latest data read and time travel with point in time query for older version.Traversing through historical data changes, with Point in Time query is tedious and would require multiple time interval analysis for a given data.

Let’s see how can we overcome this with a workaround using a classical approach.

Let us consider a table with product details and seller discount for this illustration.

Hive Table (product_dtl)

Steps

Let’s put this data into a Hudi table using Spark.

Launching Spark Shell with Hudi dependencies

Once the spark shell is launched, we can import the libraries, and create our Hudi table as below.

Importing Hudi libraries & creating Hudi target table

After adding data to the bucket, this is how our Hudi target table will look like.

Hudi target table (product_dtl)

2. Let us assume that our incremental data is stored in the below table (Non hudi format, can be hive).

3. Now let’s filter out all the Insert only records in the delta table by doing a left anti join on the target table.

Filter Insert only records

4. We have a dataframe with insert only records. Next, let’s create a dataframe which will contain attributes from both delta and target table, with an inner join on target, which will fetch records which needs to be updated.

Update records , in both delta and target tables

5. Now we have a dataframe which contains both new and old data in a single record, let’s pull active and inactive instances of updated record, in respective separate dataframes.

Split update records into active and inactive

while doing above exercise, we will obsolete the inactive record by changing its eff_end_tsto eff_start_ts -1 of the active (new) record and updating actv_ind = 0

Create active and inactive updates dataframes

6. Now we will pull inserts, active updates and inactive updates into a single dataframe with union operator. Take this dataframe as a delta source for the final Hudi write logic.

Hudi upsert logic & final data of the target table after upsert operation

Few points to consider during implementation

For every update of an existing record, the parquet file will be re written/moved across the storage, this might have an impact on the performance while writing
It’s always a better idea to partition the target table on attributes which represent major filters during querying of data. Ex: Sales Dates in case of sales tables, Sellers registered for product catalog. In our example, we have chosen actv_ind since we wanted to keep it simple for explanation and keep all active records in one partition.

Conclusion

As we evolve using Apache Hudi for Spark applications, we will continue to refine the strategies for loading data. The above attempt is just a start in achieving the functionality of SCD-2 with Hudi.

References

https://hudi.apache.org/docs/overview