Slowly Changing Dimensions at Scale

Strategies to model slowly changing dimensions in Big Data

Aaditya Bhat
Analytics Vidhya
4 min readMay 19, 2022

--

Photo by fabio on Unsplash

In a dimensional model, data is categorized as Facts or Dimensions. Facts are typically (but not always) numeric values that describe the activity of the business, while Dimensions are descriptors that provide more context around the facts. If you consider a dimensional model for a bank, transactions are the fact, while customer, branch, date, device, etc. are the dimensions. The attributes of dimensions change much slowly with respect to facts. For example, if you consider the User dimension, the user’s address, email, phone, etc. may change over time.

In Ralph Kimball’s The Data Warehouse Toolkit, he proposes various design patterns to accommodate slowly changing dimensions, Type 0, Type 1, Type 2, Type 3, etc. Here’s a quick summary of these types. In Type 0, we do not update the changed attribute of the dimension, this seems very counter intuitive and has very limited real-life use-cases. Type 1, is the opposite end of the spectrum, where the changed attribute is always overwritten in the dimension table. In this approach the ability to track history is lost. Type 2 is the most comprehensive approach where a new row is added whenever there is a change in dimension attribute. Type 3 is a tradeoff where the changes are tracked in columns instead of rows. Type 3 approach works well when there are fixed number of changes at a regular cadence, for example, tax_filing_status attribute in the user dimension can be modeled as type 3, tax_filing_status_2020, tax_filing_status_2021, tax_filing_status_2022, etc. Type 3 approach is not suitable when the dimension attribute changes are askew. Fig 1 provides visual representation of slowly changing dimension Type 0 through 3. There are various other Slowly Changing Dimension types which are combinations of the above 4 types, but those are designed for very specific use-cases and not widely used.

Fig 1: Slowly Changing Dimensions

With the introduction of slowly changing dimensions out of the way, let’s explore what are some good strategies to model slowly changing dimensions in big data ecosystem. Type 2 approach is very comprehensive, it gives the ability to do point in time analysis as well as provides a quick view of the current state of the dimension. However, it adds complexity to the codebase. The data pipelines for the type 2 dimension tables are complex and cumbersome to maintain, as well as consuming the type 2 dimension table is not always straight forward for downstream use-cases. Partitioning scheme for type 2 dimension table is tricky as opposed to the common date/hour partitioning scheme for the fact tables. The different partitioning schemes for the dimensions vs the fact tables coupled with the complex downstream queries to consume type 2 dimension table can result in inefficient queries. Therefore, if you are dealing with big data scale you are better off avoiding the type 2 dimension table.

An obvious question arises, if not type 2 slowly changing dimensions then what? The surprising answer to this question follows the KISS (Keep it simple, stupid) principle. Storing the daily snapshots of the dimension table with a view pointing towards the latest snapshot is a good design to model slowly changing dimensions. For example, user dimension and the view would be modeled as fig 2, where dt, the date of the snapshot is the partitioning column, and the view points to the latest snapshot for the dimension.

Fig 2: dim_user and dim_user_view example

At first, this approach seems redundant and repetitive, storing almost the same data every day seems inefficient, but we will get to the storage aspect of this approach later. Let’s compare this approach with the type 2 slowly changing dimension approach. Type 2 approach gave us ability to do quick point in time analyses as well as provided the view of current state of the dimension, this approach gives us both of these benefits. Additionally, the shortcomings of the type 2 approach in terms of the complexity to build, maintain and consume type 2 dimension tables are mitigated in this approach. The pipeline to create and maintain tables in this approach is quite straight forward. Consuming this table is also very easy, if the use-case requires the current state of the dimension, user can query the view, if the use-case requires point in time view of the dimension, user can query the relevant dt partitions or join the fact tables to the dimension table on dt column. Partitioning the dimension table on dt column avoids expensive full table scans and results in more efficient queries. This approach also opens up time-series use-cases for dimension data which would have been very difficult in type 2 approach.

Having looked at the various benefits of this approach, let’s revisit the storage aspect. Given that the dimension tables are generally much smaller than fact tables, having daily snapshots of dimension table wouldn’t make a huge dent in the overall storage requirements. As storage is much cheaper compared to computation and development time, trading off storage for reduced computation and development time is a good decision. Lastly, there are strategies to manage the storage implications of this approach with retention policies and cold data storage. Having appropriate retention policies guided by legal and business requirements to delete the older partitions of dimensions data, and moving rarely accessed partitions to cold data storage can alleviate the storage costs of this approach.

--

--

Aaditya Bhat
Analytics Vidhya

Engineer with a passion for exploring the latest developments in ML and AI. Sharing my knowledge and experiences through writing.