Resolving Late Arriving Dimensions

how to handle late arriving dimensions

Soumendra Mishra
DataSeries
4 min readSep 19, 2021

--

In Data Warehouse ETL process, Late Arriving Dimension represents the detection of record(s) arrives in a fact table and its corresponding dimension doesn’t exists. As a result, a late arriving dimension has a natural key which exists in the new fact data, which does not yet exist in the dimension.

Late arriving dimensions also called as Early Arriving Facts, which occurs when dimension data arriving in the data warehouse later than the fact data that references that dimension record.

Late Arriving Dimension Scenario

In the common scenario for a data warehouse, ETL (extract, transform, and load) processes are designed in a manner that dimensions are fully processed first. Once the dimension processing has been completed, the facts are loaded, with the assumption that all required dimension data is already in place.

Late arriving dimensions break this pattern, because the fact data is processed first, before the associated dimension information. The ETL processes can encounter fact records that are related to new dimension members i.e. members that have yet to be processed.

Influence of Data Warehouse Keys in Late Arriving Dimension

In the context of dimensional models, natural keys and surrogate keys has distinguished roles while conferring late arriving dimensions.

When processing late arriving dimension rows, the natural key helps in identifying whether the dimension record is new or already exists in the dimension table. The surrogate key allows to link the fact table to the dimension.

Data Model

  1. ProductSales represents the source of sales transactions
  2. FactSales is the fact table where the sales transactions are recorded
  3. DimCustomer is the late arriving dimension. CustomerCode is the natural key for a customer, and CustomerKey is the surrogate key

Design Approaches

  1. Never Process Fact
  2. Park and Retry
  3. Inferred Flag

Approach-1: Never Process Fact

  1. Lookup for CustomerCode in DimCustomer dimension table. If there is no corresponding dimension record found, the approach is to discard the record from loading into fact table i.e. never process fact record until dimension record is available.
  2. In this example, record having CustomerCode [CUST-004] is discarded while loading into FactSales table.

Approach-2: Park and Retry

  1. Lookup for CustomerCode in DimCustomer dimension table. If there is no corresponding dimension record, the approach is to insert the unmatched transactional record into a landing table and retry in next batch process to load data into fact table.
  2. In this example, record having CustomerCode [CUST-004] is inserted into a landing table and will retry in next batch process to load from landing table to fact table if it matches with corresponding dimension record.

Approach-3: Inferred Flag

  1. Lookup for CustomerCode in DimCustomer dimension table. If there is no corresponding dimension record, the approach is to insert a new dimension record and push the reference key to fact table.
  2. A new dimension record will be inserted with autogenerated surrogate key, but all the other column values will be stored as N/A (not available).
  3. InferredFlag will be added as True.
  4. When actual dimension record available, columns marked as N/A will be updated with new data received and InferredFlag will be updated as False
Before Dimension Record Available
After Dimension Record Available

Conclusion

Late Arriving Dimensions is a common scenario in data warehouse solution. There are multiple techniques adopted by various organization to overcome this challenge. The main objective of this blog is to safeguard data integrity by adopting to one of the specified solutions and achieving the goals necessary to help business run smoothly and effectively.

--

--

Soumendra Mishra
DataSeries

Passionate Leader, Technology Enthusiast, Innovator, and Mentor